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Foreword 



This volume contains the revised versions of papers presented at the fourth 
international Workshop on Implementing Automata (WIA), held 17-19 July, 
1999, at Potsdam University, Germany. 

As for its predecessors, the theme of WIA99 was the implementation of auto- 
mata and grammars of all types and their application in other fields. The papers 
contributed to this volume address, among others, algorithmic issues regarding 
automata, image and dictionary storage by automata, and natural language pro- 
cessing. 

In addition to the papers presented in these proceedings, the workshop in- 
cluded a paper on quantum computing by C. Calude, E. Calude, and K. Svozil 
(published elsewhere), an invited lecture by W. Thomas on Algorithmic Pro- 
blems in the Theory of co- Automata, a tutorial by M. Silberztein on the INTEX 
linguistic development environment, and several demonstrations of systems. 

The local arrangements for WIA99 were conducted by Helmut Jiirgensen, 
Suna Aydin, Oliver Boldt, Carsten Haustein, Beatrice Mix, and Lynda Rob- 
bins. The meeting was held in the Communs building, now the main university 
building, of the New Palace in the park of Sanssouci, Potsdam. 
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Technische Universitat Miinchen 
Universite de Rouen 
Universitat Miinchen 
Universitat Potsdam 

and University of Western Ontario 
Universite de Tours 
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Rheinisch-Westfalische Technische Hochschule Aachen 
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The work of the program committee, the reviewers, and the local arrange- 
ments team is gratefully acknowledged. 

At the general WIA meeting it was decided to rename WIA into International 
Conference on the Implementation and Application of Automata (GIAA) and to 
hold the first GIAA, that is, the fifth WIA, in London, Ontario, Ganada, in the 
summer of 2000 in conjunction with the Workshop on Descriptional Complexity 
of Automata, Crammars, and Related Structures (DGAGRS) and a special day 
devoted to the 50th anniversary of automaton theory. The complete event would 
be called Half a Century of Automaton Theory. 
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H. Jiirgensen 
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Abstract. In this paper, we deal with minimization of finite automata 
associated with finite languages all the words have the same length. This 
problem arises in the context of Constraint Satisfaction Problems, widely 
used in AI. We first give some complexity results which are based on the 
strong relationship with covering problems of bipartite graphs. We then 
use these coverings as a basic tool for the definition of minimization 
heuristics, and describe some experimental results. 



1 Motivations 

Many AI problems can be expressed as Constraint Satisfaction Problems or CSP 
for short |Mon74| . A CSP involves a finite set of variables, a finite set of values 
for the variables and a set of constraints. Each constraint is defined as a relation 
on some subset of variables and gives the values which are mutually compatible 
for these variables. A solution is a value assignment to variables that satisfy all 
the constraints. Most of the CSP’s works deal with the problem of computing 
one solution. Nevertheless, in some applications (e.g. design problems), it is nec- 
essary to compute and represent all the solutions. 

In this paper we address the issue of representing and computing all solutions. 
One approach for this problem, proposed by Vempaty pZESn21, is to use Finite 
Automata (FA)0. Given a permutation of the variables set, the set of solutions 
appears as a regular language and can be represented by its Minimal Determinis- 
tic Finite Automata (MDFA). Solution sets languages (Ln) are finite sets words 
of equal length. Using FA allows incremental construction of the solutions set 
by applying classical operations on FA associated to each constraint. The effi- 
ciency of this method depends on the size of intermediate MDFA. Computation 
of MDFA recognizing finite languages has been studied in jHev9 lll)WW98j . In 
this paper, we propose a more compact representation : Non Deterministic Finite 
Automata (NFA) . It is well known that NFA may be exponentially more compact 
than equivalent DFA ; this property is preserved on the Ln class. However NFA 
minimization is an harder problem. The only studies on this problem concern 
the general case fKim74IMF95IKW7ni . In this paper we study this problem on 
the language class Ln. 

^ A similar approach using OBDDs has been used in order to represent boolean func- 
tions mm . 
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2 Definitions 

In the rest of this paper, we focus on FA recognizing Ln languages. We will use 
the following notations. 

A finite state automaton A is a quintuple (Q, S, 6, 1, F) where Q is a set of 
states, (5 is the transition function Q x A — >■ 2*5, / and F are respectively the 
start and the accepting states. We consider only automata with an unique final 
state since such automata are sufficient to recognize Ln languages. 

The right language (resp.fe/t language) of a state q is Ct>{A, q) = {m € S* / F G 
S*{q,m)} (resp. Cg{A,q) = {m G S* / q G S*{I,m)}). In what follows, we will 
suppose that all states are accessible (their left language is not empty) and coac- 
cessible (their right language is not empty). To denote the transition function, we 
will also use 7 ( 4 ( 9 ) = {{d,q') / q' G 6{q,d)} and 7^(<?) = {(<?', d) / 9 S S{q',d)}. 
Automata recognizing Ln languages have special properties. First, since Ln lan- 
guages are finite, they are acyclic. Moreover, since all words are of equal length, 
set Q can be decomposed into levels. For a given state q, all the words of its left 
language have the same length i. We will say that i is the level of q and denote 
by 7V’_4 (j) the set of states on level i. An automaton recognizing a Ln language 
L C A” has n -I- 1 levels and is such that A/(4(0) ={/} and A/( 4 (n) ={A}. 

This work deals with the minimization of nondeterministic FA (NFA) and 
unambiguous FA (UFA). A UFA is an NFA in which there is a unique accept- 
ing computation for every accepted strings. The Flower Automaton is a distin- 
guished automaton among all UFA recognizing the same Ln language. 
Definition 1. Let L C A", the Flower Automaton of L is a UFA A = 
(Q, S, 6, 1, F) such that Wq G Q \ {/, F} | 7 ^( 9 )| = |7^(9)| = 1. It is the biggest 
UFA recognizing L. 

The use of NFA is motivated by the fact that NFA can be more compact than 
the equivalent MDFA. For general languages, the MDFA equivalent to a A:-state 
NFA may have 2^ states. The following example shows that such a size difference 
also exists between UFA and MDFA recognizing Ln languages. 

Let A = {1, 2, ..., n} and " 3^-^ -j C A”. 

Consider the FA An with the following n + 1 levels : 

AOi(O) = { 90 , 1 } A0l(^^) = {9n.i} A0i(*) = {qi,i,qi,2,-,qi,n} (0 < z < n). 

The An transition function is defined by : 

Vi e {l,2,...,n - 2}, Vp e {1,2, ...,n}, Vj G S - p, 

<5(90, 1 ,P) = 9 i,p d{q,^p,j) = qi+i^p ^( 9 n-i,p,p) = 9 n,i- 
Figure d shows A 3 and its equivalent MDFA. An is an UFA recognizing L„. 
It has n X {n — 1) + 2 states. For the equivalent MDFA, the number of states is 
greater than 2" since for each non empty proper subset of A there exists a state 
of MA{n — !)■ 

3 Complexity of FA Minimization for Ln Languages 

For general languages, FA minimization is PSPACE-hard In order to 

study the complexity of FA minimization for Ln languages, we introduce related 
problems on bipartite graphs. 
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Fig. 1. As and the equivalent MDFA 



A bipartite graph is a finite, simple, undirected graph, given by {Xb,Yb, Eb), 
where Xb et Yb partition the vertices of B in two independent subsets, and Eb 
denotes the edges of B. 

A hiclique K of B is a pair (A, Y) such that X C Xb, Y CYb,X and Y are non- 
empty, and B{X, Y), the subgraph induced by {X, Y), is complete {XxY C Eb)- 
A hiclique covering of B is a set of bicliques R = {(Ai, Yi), (A 2 , Y 2 ), {Xk, Yk)} 
such that Eb = ^ 

A biclique cover is a biclique decomposition of B if and only if for every distinct 
bicliques Ki and Kj of R, Ki and Kj have no common edge. 

We denote by RPB (resp. DPB) the decision problem associated with the com- 
putation of the minimum size of a biclique cover (resp. biclique decomposition) 
of B. 

There is a strong relationship between biclique coverings of bipartite graphs 
and FA when all the words have length 2. We can associate a bipartite graph 
to each L2 language and there is a bijection between the FA (resp. UFA) recog- 
nizing this language and the bipartite biclique coverings (resp. decompositions). 
Figure 121 shows this relationship. 




Fig. 2. Bipartite graph and corresponding Flower FA ; FA and corresponding biclique 
covering 
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We denote by (Flower FA — >■ FA)-n (resp. (Flower FA — >■ UFA)-n) the prob- 
lem of constructing a minimum FA (resp. UFA ) equivalent to a given Flower 
FA for Ln languages. In the following lemma, we also use (Flower FA — >■ FA)-n 
and (Flower FA — ?> UFA)-n for the associated decision problems. 

Lemma 1. (Flower FA ^ FA) -2 (resp. (Flower FA ^ UFA)-2) is polynomially 
equivalent to RPB (resp. DPB). 

Proposition 1. The decision problems associated with (Flower FA — >■ FA)-2 
and (Flower FA — ?> UFA)-2 are NP-complete. 

Proof. NP-completeness for (Flower FA — >■ FA)-2 can be obtained by the previous 
lemma and the fact that RPB has been proved NP-complete fOrl77j 
The DPB problem has been proved NP-complete by ; it is an easy consequence 

of a result of which states that obtaining a minimal UFA equivalent to a given 

DFA is NP-complete. 

Corollary 1. Minimization problems (FA — ?> FA)-n, (FA — )> UFA)-n are NP- 
hard. 

4 FA Level Minimization Using Biclique Coverings 

Because of the close relationship between biclique coverings of bipartite graphs 
and FA, we propose to use biclique coverings as a basic level minimization step. 
Let us first define the bipartite graph B/. that can be associated with each level 
fc of a given FA A and the FA A/Bk that can be computed from A and any 
biclique covering of Bk. 

Definition 2. Let A = (Q, A, 5,/, A) be a FA recognizing L C A” and k, 0 < 
k < n be a level of A. 

• The bipartite graph of level k of A is Bk = {Xk,Yk, Ek) with Xk = MA{k), 
Yk = U 96 Af^(/c) 7 i(' 7 ) and Ek = {{x,y} / xeAfA{k),yG 7i(a;)}. 

• Let Rk = {(Vi, Wi), (V 2 , IU 2 ), ..., (Ipi tFp)} be a biclique covering of Bk. Then 

AjRk denotes the FA obtained from A by replacing the states of MA{k) with 
■^A/RkW — {Fi / {Vi^Wi) € Rk} and by updating the transition function to 
have and = UgGU 

An example of FA A and A/Rk is given in Fig. 0 . 

We can note that for an FA A and a biclique covering of Bk, the FA A/Rk 
has |i?fc| states of level k while the other levels remain unchanged. 

Let us introduce the following technical lemma from which some of the proofs 
in this paper can be directly obtained, as for example, the equivalence of A/Rk 
and A. 

Lemma 2. Let A be a FA recognizing L C A", Rk be a biclique covering of Bk 
and let A' denotes the FA A/Rk. bFe have then the following properties: 

(i) yVeAfA^(k), Cg{A',V) = [}^^yCg{A,q) 

(ii) VV€AfA'{k), Cv{A',V) C f]^^yCj,{A,q) 

(hi) Vf G {0,1,. ..,n}, i^k,WqG AfA'{i), Cg{A',q) = Cg{A,q). 

(iv) Vf G {0,1, ...,n}, i^k,WqG Ma'{ 1), C-D{A',q) = C-D{A,q). 
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/?2 = {^1 ) ^2 » ^3 ) ^4 } with : 




Fig. 3. FA minimization from a biclique covering of the bipartite graph associated to 
level 2 



Proof, (i) and (ii) can be directly deduced from definition of J\! 

(iii) is verified for all levels smaller than k and we have (iii) for all levels greater than 
fc + 1 if and only if we have (iii) for level k + 1. 

(iv) is verified for all levels greater than k and we have (iv) for all levels smaller than 
fe — 1 if and only if we have (iv) for level k — 1. 

To show (iii) for the states of level k + 1 and (iv) for the states of level fe — 1, we show 
that there is a path {qk-i, Qk, Qk+i) such that qi G NA(i) labeled m in M if and only if 
there is a path from qk-i to qk+i labeled m in A' ■ 

By definition of Rk, there is (V) W) G Rk such that qk G V and (m(fe + l),qk+i) G 
W. We then have by definition of A', V G AfA'(k) with {qk-i,m{k)) G 7 (^/(F) and 
(m(fe + l),gfe+i) G 7i(^)- 

<= Let {qk-i, V,qk+i) with qt G MA'ii) be a path of A' labeled m. By definition of A' , 
there is some qk G V such that {qk-i,m{k)) G 7 ( 4 (qfc) and, there is a pair (V,W) G Rk 
such that (m(fe+l), qk+i) G W. As (V, W) is a biclique of Bk, {m{k + l), qk+i) G 'y\{qk). 



Proposition 2. Let A he a FA reeognizing L C A'” and Rk, a bielique eovering 
of the bipartite graph Bk of A. A/Rk reeognizes L. 

Moreover, under some simple constraints on Rk, this transformation keeps the 
property of being deterministic or non ambiguous. 
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Proposition 3. Let A be a DFA reeognizing L C E". If Rk is a biclique 
covering of Bk such that y(Vi, Wi), (Vj, Wj) G Rk (i ^ j) we have ViPiVj = 0, 
then A/Rk is a DFA. 

Proposition 4. Let A be a UFA recognizing L C 17". If Rk is a biclique 
decomposition of Bk then A/Rk is a UFA. 

Proof. Let A/Rk = A' = {Q' , E,5' A/ F'). If A' is ambiguous then there must be 
Vi,V 2 € Q' such that Cg{A' ,Vi) (Mg{A' ,V 2 ) A 0 and Ct,{A' ,Vi) n Ct>{A' ,V 2 ) A 0. 
From Lemma 0 we have ti,V 2 £ Let m £ \Cg{A .,V- i).Ct>{A' ,Vi)) n 

{Cg{A' ,V2 ).Cti(A' ,V 2 )) and qk £ MA{k) be the state of level fc of ^ in the only 
path labeled m in A. From (i) and (ii) of LemmaElwe can deduce that qk GV\C\ Vi- 
Let m = m\dm 2 with mi of length fc; by hypothesis on m, we have in A' two paths 

I > Pi 4 Si > F and I > Pa 4 sa "" > F . "here the states 

Si and Sa belong to level k + 1. As Rk is a biclique decomposition, si 7 ^ sa- More- 
over, by Lemma El the right and left languages of si and sa are the same in A and 
A' . We have thus constructed in A two distinct paths / *" 1 *^ > ; p and 

j *"1*^ ) ; p, which contradicts the hypothesis that A is an UFA. 

5 FA Minimization Henristics as a Sequence of Level 
Minimizations 

We propose to minimize a FA by applying a finite sequence of our level minimiza- 
tion step. We first define efficient heuristics for the biclique covering problem and 
next the order in which the levels are minimized. 



5.1 Biclique Covering Heuristics Inspired by Nerode Equivalence 

There are many possible heuristics for the minimum biclique covering problem, 
which can in fact be formulated as a graph coloration problem The 

heuristic we propose is a generalization of the Nerode’s well-known equivalence 
relation. Two states q and q' are Nerode Equivalent (q Ri q') if and only if 
£d(AI, g) = Cx>{A,q'). For our particular automaton, two states belonging to 
different levels cannot be Nerode Equivalent. Therefore Moore’s classical recur- 
sive definition of this relation can be stated as follows : if the Nerode equivalence 
relation is equality for all pairs of states belonging to any level greater than k, 
then for all qi,q 2 £ 7\4(fc) we have qi ^ 52 = 1 a(^ 2 )- 

Let A be a FA recognizing L C A" and go, gi two different states of this FA 
such that 7 ^ (go) = 7^(gi). We call equality reduction the operation consisting 
in modifying A in a new FA Al by removing the state go and updating the 
transition function for gi in such a way that 7^/(gi) = 7^(gi) U 7^(go). 

It is then obvious that a MDFA can be computed from a given DFA A by 
applying successively all equality reduction operations on A from level n — 1 
down to level 1. 

We next show that applying all the equality reduction operations on level k can 
be achieved by computing a particular biclique covering of the bipartite graph 
associated to this level. 
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Proposition 5. Let A be a FA reeognizing L C li'". Let T be the funetion 
which computes from any given bipartite graph B = (X, Y, E) the set of bicliques 
{{twins (x), N{x)) ! X & X{ where N{x) denotes the neighborhood of x in B and 
where twins{x) denotes the sets of vertices having the same neighborhood as x. 
Then for any level k of A, A/T{Bk) is isomorphic to the FA obtained by applying 
all equality reduction operations on states of level k. 

We propose a generalization of the equality reduction operation and a func- 
tion enabling to compute a biclique covering from a given set of such reductions. 

Definition 3. Given a FA A, the couple {qo, {gi, ...,qk}) is a union reduction if 
do ^ {<?!, —,dk} and 7^(go) = 7i(9i)U7i(92)U...U7^(gfc). Lt is a disjoint union 
reduction if and only if 7^(gj) H 7^(gi) = 0, Vg^, g^ G {gi, qk}- Applying this 
reduction to A consists in modifying A by removing state go and updating the 
transition function for states gi,g2, ■■■,qk w such a way that ^f^,{qi) = 7^4 (gi) U 
7^(9o)jVgi G {gi, g2, gfe}. k denotes the degree of the reduction. 

Figure 0 gives an example of a union disjoint reduction operation of degree 
2 on the FA of the previous figure (a MDFA which was not reducible with the 
equality reduction operation). 




Fig. 5. Union Reduction operation on a MDFA 



Note 1 . For the sake of simplicity, we assume in the rest of this paper that there 
is no equality reduction operation on the considered FA levels. 

To compute a biclique covering of Bk from a set of reductions, we use a 
function Fred defined by : 
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Definition 4. Let A be a FA reeognizing L C and E = 

{{qi,Si),{q2,S2), ■■■i{qn,Sn)} a set of union reduetion operations on the states 
of level k of A sueh that the qi are all distinct. Let G{E) be the directed graph of 
vertices whose edge set is {{qi,qj) / 3 (qi,Si) € E,qj G Si}. Vg G A/'^(fc) 

let Pred[q] = {g}U{g' G A/"_4(fc) /3a directed path in G{E) from q' to q}. Then, 
Rec{E) = {(Pred[g],7^(g)) / g G (A/U(fc) \ {gi,. . . ,gn})}- 

We then have the following Proposition: 

Proposition 6. Given a FA A recognizing L C if" and E a set of reductions on 
the states of N QMA{k), Rec{E) is a biclique covering of Bk by |A 0 i(fc)| — |A^| 
bicliques. Lt is a biclique decomposition if all the reductions in E are disjoint. 

Proof. • Rec{E) is a biclique covering of Bk. 

Let E = {(gi,Si), (q2,S2), ..., (qn,Sn)} {N = {qi,q2, ...,qA)- Let C = (Pred[g], 7^(g)) 
a pair of Rec(E). If Pred\q\ = {g} then, C is clearly a biclique of Bk. Otherwise, by 
definition of Pred, Vq' G Pred[q], we have 7^(g) C 7^(g') and C is also a biclique of 
Bk. Let us show that each edge e = (g, (d, q')) of Ek is covered by a biclique of Rec{E). 
If g ^ A the biclique (Pred[g], 7^(g)) in E covers e. Otherwise, let P = {pi,p2, ■■•jP?} 
be the set of sinks of G{E) for which there is a path from g to p. It is then obvious 
from definition of G{E) that 714(g) = Upgp7l4(p) and, as i? = {{Pred[p],'y^{p)) / p G 
P} C Rec{E) and as Vp G P, g G Pred[p], there is a biclique of R covering e. 

• If V(gi, Si) G E, {qi, Si) is a disjoint reduction then Rec{E) is a biclique decomposition 
of Bk. 

Let Ko = (Pred[go], 7lJ(go)) and Ki = (Pred[gi], 7lJ(gi)) G Rec{E). If Kq and Ki 
cover a same edge then Pred[go] n Pred[q{\ / 0 and 7^(go) O 'y\{qi) A 0 - Let us 
show that this leads to a contradiction. Let us assume Pred[qo\ n Pred[q\\ A 0 and 
let g G Pred[go] H Pred[qi\ be one of the closest states to go and gi. Let (g, S) be the 
reduction on g. Let go G Pred[go] fl S and g( G Pred[q{\ n S. We have : 

- 7 i(' 7 o) C 7(4(90) because g(, G Pred[qo] 

- 7^(91) C 7 l( 9 i) because g( G Pred[gi] 

~ 7(4(90) I ~1 7 ^(gi) = 0 because (g, S) is a disjoint reduction and q'o A li- 
We then have 7(4 (go) n 7^(gi) = 0 . 

Finally, level minimization consists in computing a maximum set of reduction 
operations E, computing the associated biclique covering Rec{E) as in Defini- 
tion 0 and then computing the reduced FA A/Rec{E). The next section provides 
a way to apply level reductions in order to compute a small FA. 

5.2 Order of Level Minimization 

For our languages, it follows from Proposition 0 that the minimization of a deter- 
ministic automaton can be done by computing a sequence of level minimizations. 
Moreover, Moore’s recursive definition of the Nerode’s equivalence leads directly 
to an efficient order for these level minimization steps : from right (greatest 
level) to left (smallest level). Keeping this order for our generalization leads to 
automata without reduction operations : 

Assume that there is no reduction operation on any state of level greater 
than k. Assume that if is a maximum set of union reduction operations on states 
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of level fc of a FA A. It then follows from the maximality of E and from Defi- 
nitions that there is no union reduction operation on the states of level k of A/E. 

On the one hand, it is clear that our algorithm can be very efficient. For 
example, it can be used to compute the small FA in Fig. Q] from the exponentially 
bigger equivalent MDFA. But, on the other hand it can fail to reduce some big 
automata. Consider the FA obtained from the MDFA in Fig. S by reversing 
its transitions (we will call it the reverse automaton). This FA clearly has no 
reduction operation and our minimization heuristic fails to reduce it when it can 
compute an exponentially smaller FA for the reverse automaton. 

Now, let us call right reduction operations the reductions defined in S and 
define the left reduction operations. The couple {qo,{qi, ...,qk}) is a left union 
reduction on a state q of A if qo,qi, ...,qk are different states of A such that 
7^(90) = 7^(91) U 7^(92) U ... U j//{qk)- Let A be the reversed automaton of 
the MDFA in Fig.^ and let A! denote the reversed automaton of the smallest 
FA in this figure. A is reducible with respect to left reduction operation and 
there is a sequence of such reductions which compute the exponentially smaller 
FA A. But, because of the symmetry of the left and right reduction operation, 
A! can be obtained by reversing the FA computed by our right reductions based 
algorithm on A. 

As an efficient minimization algorithm with respect to left and right reduc- 
tions should verify at least that the resulting FA has no left or right reduction 
operations, we propose to apply our heuristic to A and A. Nevertheless, a new 
right reduction operation can appear after a left reduction operation and vice- 
versa. Thus in order to compute a FA without right or left reduction operation, 
the final algorithm proposed in next section proceeds to more than one step of 
minimization of A and A , in fact as many as necessary. 



5.3 Final Algorithm 



Our FA minimization algorithm requires a function computing a maximum set 
of reduction operations. The choice of the reductions which it will be used to 
minimize a level is very important because it determines the reduction operations 
which will be possible to compute for the next level to minimize. From this point 
of view, it is a greedy algorithm : when choosing a set of reductions, we don’t 
care what happens on other levels after applying it on a given level. There may 
be several union reductions for a given state, and for computational efficiency, 
we have to restrict the type of reductions (maximum degree fixed or not, disjoint 
or not). In fact, we have tried several possibilities and experimentally chose this 
one : computation of a maximum set of disjoint reduction operations of degree 
2. This is in fact the smallest generalization of the equality reduction operations 
which can be used for DFA minimization. Note the pre-minimization of the level 
(line 1 ) removing equality reduction operations achieved before the computation 
of the reduction set (cf. Noted). 
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Algorithm 1: MinimizeFA 

Data : A FA A = (Q, S, S, /, F) recognizing L C Ai” 

iV^ IQI ; 

repeat 2 times 

foreach level k from n — I to 1 do 

1 A ^ AfTiBk) ; 

E •<— maximum set of disjoint reductions of degree 2 ; 
A ^ A/Rec{E) ; 

lA^U; 

if IQI ^ N then MinimizeFA(A) 



Proposition 7. Given a FA A recognizing L C A7", MinimizeFA(A) is equiv- 
alent to A, contains no left or right disjoint reduction of degree 2 or less and is 
a UFA if A is non ambiguous. 



6 Conclusion 

We have run some experiments on a French dictionary and on pseudo-random 
languages. The French dictionary was changed in a Ln language by adding some 
special character * at the end of each word. The random Ln languages were 
generated with respect to the same fixed alphabet and language size with variable 
n. This results have been obtained on a Linux Pentium 350Mhz 64Mo RAM, 
times are given in seconds. 

On one hand, these results are satisfying : UFA are 25% smaller in number 
of states than equivalent MDFA for the dictionary and, for random languages 
the gain in number of states goes to 50% down to 0 as the ratio number of 
words/language size decreases. But on the other hand, if we consider the only 
real parameter that reflects FA size i.e. the sum of the number of states and of 
the number of transitions, we must be much more reserved. Indeed, if the UFA 
can be smaller than the MDFA (up to 22% smaller), there are some languages 
for which it is larger, in the worst case (for random language of length 5) nearly 3 
times larger. In fact, this can be explained by the choice of the biclique coverings 
: a reduction operation always decreases the number of states, but the number of 
transitions removed can be smaller than the number of transitions that appeared. 
It is clear that our general minimization scheme based on bipartite coverings is 
interesting, in particular because it enables us to compute non ambiguous FA. 
But it is also clear that the computation of the biclique coverings must take into 
account the number of transitions that will be created. 

It seems that the number of states is always used to measure the size of 
automata. These experimentations shows that it would be better to also take 
into account the number of transitions by defining the size of an automaton 
as the sum of its number of states and number of transitions. This is clearly 
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Table 1. French dictionary ; |Ln| = 58233, n = 26, |T'| = 35 





time 


^states 


#st+#tr 


MDFA 

UFA 


7.54 

45.71 


29819 
23 346 


88 487 
84 992 



Table 2. Random languages of variable length ; |Ln| = 50000, |T'| = 10 





MDFA 


UFA 


n 


^states 


#st+#tr 


time 


^states 


#st+#tr 


time 


5 


2133 


18354 


0.26 


1121 


51993 


14 


6 


11022 


62637 


0.78 


9355 


68507 


240 


7 


16423 


78442 


1.2 


11069 


72136 


56 


8 


22204 


94164 


1.6 


13552 


77103 


12 


9 


32017 


114028 


1.9 


23853 


97705 


5.1 


10 


58585 


167169 


2.3 


56773 


163546 


4.6 


11 


100147 


250294 


2.6 


99839 


249678 


5.3 


12 


148307 


346613 


2.8 


148273 


346545 


6.1 


13 


198073 


446145 


3.1 


198069 


446137 


7 


14 


248076 


546151 


3.3 


248075 


546150 


7.7 



important from a practical point of view, but perhaps also from a theoretical 
one. We think that the following example illustrates this well. 

Let ifi = {1,2,...,2'= - 1}, Ifa = {1,2, ...,fc} and / : 1 2^= - 0 

be a bijection between Si and 2^^ \ 0. We defined then Lf~ ^ S\ x S 2 by 
ij G Lfc j G f{i). Let A be the FA of states Q = {/, F} U {gi, 92 , 92 '^-i} 

and transitions : Vt G Ai, 5(7, i) = qi and Vi G 772, Vj G /(i), 5{qi,j) = (F'}. A is 
the MDFA recognizing Lk- Let A! be the FA of states Q' = {/, F}U{9i, 92 , 9fc} 
and transitions : Vi G Z?i,(5'(/,i) = {qj / j G /(i)} and Vi G 772, <5'(9i, i) = |F}. 
A' is clearly a UFA recognizing L. 





Fig. 6 . An UFA and a MDFA recognizing L 3 
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It is then obvious that we have presented another example of a UFA 
exponentially smaller than the equivalent MDFA : \Q'\ = fc + 2 opposed 
to IQ I = 2^ + 1. But taking a look at number of states plus num- 

ber of transitions (denoted by RepSize) leads to another conclusion : 
RepSize(A) = (2^ + 1) + and RepSize(A') = {k + 2) + {k + X) with 

X = and then RepSize{A) G 0{RepSize{A')). 
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Abstract. We here present the system SEA which integrates manipula- 
tions over boolean and multiplicity automata. The system provides also 
self development facilities. 



1 Introduction 

The SEA symbolic environment arose from the dream of having within the same 
system different theories over automata facing each other, so that similarities 
and differences could be at hand. At the present stage, two theories offering 
minimization processes (however with some major differences, see below) were 
choosen: the boolean case and the case of multiplicities in a field (here Q to 
begin with). Many operations are common such as simple rational laws: union, 
concatenation, Kleene’s closure (and their counterparts: sum, Cauchy product, 
star and external product) and extended rational laws: shuffle, Hadamard (in- 
tersection) and infiltration product. 

A lot of softwares on automata have been developped. Let us cite automate 
m, amoRE H3] and Grail HEj. The package implementing boolean theory in 
SEA is the new version of the package AUTOMAP P|. The advantages of the 
symbolic computation system Maple are numerous. The first one is in providing a 
simple and easy to use environment. The tight correspondance of the fundamen- 
tal data types of Maple (lists, sets, tables) and the algebraic structure studied 
(automata, matrices) as well as the fact that the results of a computation can 
be used as input data for a subsequent computation have been good criteria of 
choice. The conversion of a mathematical expression into a Maple procedure is 
also quite natural; this feature implies a faster and more secure implementation 
than with a low-level language (such as C or C -I--I-). 

Many questions then arise and some major ones remain unsolved such as to 
find a deep link between the minimizations despite the fact that many features 
seem to be similar: minimal automaton of left quotients, isomorphism of minimal 
models, intertwining; the meaning of the star in the other boolean case (K = 
Z/2Z); unicity of minimal models within certain classes of non deterministic 
automata. The structure of this short presentation is the following: in a first 
part (section 2), the theoretical background is presented as well as a description 

* Corresponding author: caron@dir.univ-rouen.fr 

O. Boldt and H. Jiirgensen (Eds.): WIA’99, LNCS 2214, pp. 13-^^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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of the functions available in the environment, an example of computation is also 
provided, the second part is devoted to a technical description of the environment 
with a “how to” subsection, to end with we conclude by the description of some 
future steps as well as the setting of unsolved problems. 

2 Theoretical Aspects of Boolean Automata and 
Automata with Multiplicities 

2.1 Basic Definitions 

Let A be a finite alphabet, and K a semiring. A formal series is a mapping S from 
A* into K usually denoted by 5 = ^ {S\w)w (where (S\w) := S{w) G K is the 

w^A* 

coefficient of ic in S'). The set of series, , is naturally endowed with sum and 
external product. Concatenation is extended to series by the well defined for- 
mula R.S = E E (R\u){S\v) ] w and will be called convolution or Cauchy 

w£A* \uv—w / 

product in the case of multiplicities. Remark that, if (S|l) = 0, then for every 
w G A* the set {n/{S'^\w) yf 0} is finite and, in this case S* := 

w^A* n>0 

is well defined. Notice that in case K = {0, 1} (which occurs in two cases: B and 
Z/2Z), all functions are characteristic and series can be identified with subsets 
of A*, the languages. In the boolean (K = B) case (resp. “with multiplicities”) 
simple rational operations are union (resp. sum and external product), concate- 
nation (resp. Cauchy product), Kleene’s closure (resp. star). A language (resp. 
a series) is said rational iff it can be obtained from the letters by a (finite num- 
ber of) combinations of the rational laws. The formula thus obtained is called a 
rational expression of S. 

A (boolean) automaton over an alphabet A is usually defined prrni as a 4-tuple 
{Q, I, F, S) where Q is a finite set of states, I Q Q the set of initial states, F C Q 
the set of final states, and 6 C Q x Ax Q the set of edges. This feature can be ex- 
tended to the case of multiplicities in any semiring. A K-automaton is then a 
4-tuple {Q, I, F, i5) on a semiring K, and the sets I, F and S are rather viewed as 
mappings I : Q ^ K, F : Q — >■ K., and i5:QxAxQ— tK. In fact, a K-automaton 
is an automaton with input weights, output weights, and a weight associated to 
each edge. Here, for each word w = ai ■ ■ ■ ap G A*, the coefficient {S\w) is the 
sum of the weights of paths labeled by oi • • • Op, this weight being obtained by 
the product of input, output and edges weights of the path. This is equivalent 
(with n = IQI) to the data of a triplet (A,/i, 7 ) where A G codes the input 
states, 7 G codes the output states, and : A —>■ K”^” are the transition 

matrices, n being called the dimension of the representation. A series S is rec- 
ognizable if it exists an automaton A = (A,/i, 7 ) such that \A\ := 

w GA* 

(the behavior of A Q) is 5. Schiitzenberger’s classical theorem ITtI^ asserts 
that rationnal series are exactly recognizable series. This is then an extension of 
Kleene’s classical result m- The left quotient of a language L 2 with respect to a 
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language L\ is defined by L 2 = {v \ u- v G L 2 and u € Li} or equivalently 
{L^^L 2 \w) := (L 2 \Liw). The previous formula, well defined in the boolean case 
makes sense in general for Li with finite support. Thus, in any case, for a series 
(or a language) S, the left quotients are defined and a variant of 

Schiitzenberger’s theorem states: “S' is rational iff it belongs to a finitely gen- 
erated K-module of series stable by the operators In case the semiring is 

finite (in particular for K = B), this amounts to say that {w~^S}weA* is finite. 
In case K is a field, it is equivalent to the fact that the dimension of the K-space 
generated by the family {w~^S)w£A* is finite. Thus, in both cases, one defines 
the minimal automaton as the automaton with states {ic“^S}u,gA* (resp. a ba- 
sis of Span{{w~^ in the case K is a field) and appropriate transitions. 
Finally, we should add that the minimal automaton computed in this way, is 
complete deterministic for the booleans, whereas it is minimal in the number of 
states among them recognizing the series, for the multiplicities in a field. 



2.2 Operations over Automata 

Simple rational operations. Operations over automata are deduced from 
that on series (i.e.: on languages for the boolean case). In both cases we have the 
classical rational operations which are sum (union), Cauchy product (concate- 
nation), star (Kleene’s closure) and external product. For details on the theory 
of series the reader is refered to |2| • 



Extended rational operations. There are other operations that preserve ra- 
tionality. The implemented operations on automata corresponding to the shuffle 
(^), Hadamard (o) ^ and infiltration (f) products been developped 

in the behavior inci of the resulting automata giving respectively the shuffle, 
Hadamard and infiltration products of the behaviors of the two automata. Let 
us remark that these three products can be defined from a double continuous 
familjl] of laws preserving rationality and satisfying the following recursion, for 
a,b G A, u,v G A*, 



( 10 .., 1 = 1 , 

\ a Og., 1 — 1 (z)e,q ^ — ea, 

[ au Qe,q bv = e{a{u Qe,q bv) b{au Qe,q v)) -G qSa^ba{u v), 

with 6a, b the Kronecker symbol. With these notations, shuffle is 0i.o, Hadamard 
is 00,1 and infiltration is 0i,i. The intersection of two B-automata is then a 
specialization of Hadamard product. 

Let us define mirror (unary operation, as the star operation), in order to state 
the proposition used for implementing the algorithm of the right quotient com- 
putation in the boolean case. Notice that the notation S 2 has not the same 
meaning as L((^-L 2 because it does not correspond in series to the same op- 
erations (quotient for languages, Cauchy product of inverse of Si and S 2 for 
series). The mirror of a K-automaton is exactly the automaton in which the 



^ This family arises as a natural set of laws defined by duality 0. 
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initial states become final, the final states become initial and in which we re- 
verse every edge (that is the automaton defined by the triplet A)). The 

mirror of a word w over an alphabet A can be recursively defined as follows: 
= e, and (v-a)^ = a-v^ for a G A and v G A*. The mirror of L C A* is 
= {w^ I w G L}. The mirror of ^ G K((A)) is ^ {S\w)w^. The 

w^A* 

right quotient of the language Li with respect to the language L2, denoted by 
Li' is dually defined by: Li- = {u \ u- v G Li and v G L2}, so one has 



2.3 Minimization 

In both cases, automata (deterministic in the boolean case and any in case 
of a field) with the minimum number of states recognizing a given series are 
isomorphic to the minimal automaton previously defined m and it is well known 
that we have a minimization algorithm at our disposal (see m and nmn for 
details). 



3 Presentation of the Packages 

Several formats are used to represent automata. One for the multiplicity case 
and three for the boolean one. Functions to translate one automaton into another 
are provided. Two cases have to be distinguished (from multiplicity to boolean 
and conversely, and from boolean to boolean). 



3.1 Description of the Formats 

With the previous notations, (Q, I, F, S) corresponding to the automaton of the 
figure n will be represented by either the standard format 




Fig. 1. B-automaton 
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[1,{5}, table([ 



1 = table([ 


]). 


a = {2,4} 


3 = table([ 


6 ={3} 


a ={2,4} 


]). 


6 ={3} 


2 = table([ 


]), 


a = {2,4} 


4 = table([ 


6 ={3} 


6 ={5} 



]) 

])] 



— A single initial state (1) 

— a set of final states ({5}) 

— a transition table which is a table indexed on letters and on states 
{table[l][a\ = 2,4 represents the two edges (l,a, 2) and (l,a, 4)). 

or the list format 

[l,{5},table([ 

&=[{3},{3},{3},{5},{}], 
a =[{2,4}, {2, 4}, {2, 4}, {},{}] 

])] 



which differs from the standard format only in the representation of the transi- 
tion table. It is intended to provide an easy to use interface for users to enter 
their own automata. It is a table indexed by letters , each entry containing a list 
of sets which are a list of transitions for each states. The expression 



t[a]=[{2,4},{2,4},{2, 4}, {},{}] 

represents the edges (l,a, 2), (l,a,4), (2, a, 2), (2, a, 4), (3, a, 2), (3, a, 4). 

The deterministic format is the same as the list format (each sets of the 
transition table in the list format is replaced by the only state reached). 

These three formats represents only boolean automata. 

A K-automaton has a matrix representation. The first element is the vector 
of entry cost in each state of the automaton, the last one the vector of exit 
costs, and the second one is the list of the transitions matrices corresponding to 
each letter. Hence, the list 




18 



P. Andary et al. 



(2/3 0 0 0), 





/O 1 0 o\ 




/O 0 0 o\ 






/0\ 






0 0 10 




0 0 0 1 






1 






0 0 0 0 


5 


0 3 0 0 




5 


0 






1^0 0 0 0^ 




1^0 1 0 0^ 






\o) 





corresponds to the automaton given in figure 13 




Fig. 2. K-automaton 



For the implementation of the AMULT package, we use the tables (as in the 
ABOOL package) which are dynamic structures more efficient than the matrix 
structures. 



3.2 Operations on Automata 



Both packages provide a set of operations on automata. These operations 
can be classified into three categories. The first one gathers operations on 
languages (series): concatenation (Cauchy product), union (sum), Kleene’s 
closure (star) ,...). The table Q] gathers operations on automata which do not 
change the series that is recognized (determinization, minimization, reduction 
. . . ) . The other operations are described in table El All functions corresponding 
to these operations have a shortcut (&C, &U, &S, ...). The last one is the 
set of manipulating operations. It gathers transformation format operations 
that are described in table 0 (deterministic format to standard one, standard 
format to lists format, multiplicity automaton to boolean one •• .), as well as 
miscellaneous operations (states of an automaton , equivalence of two automata, 
...). 
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Table 1. Transformation on automata 



Boolean 

function 


Multiplicity 

function 


Parameter 


Description 


Deter 


_ 


aut : automaton 


returns the determinizod automaton 
of aut 


Mini 


mini 


aut : automaton 


returns the minimal automaton of 
aut 


Trim 


— 


aut : automaton 


returns the reduced automaton of 
aut 


Std 


— 


aut : automaton 


returns an automaton with a single 
initial state without incoming edges 


- 


reduce 


aut : automaton 


returns an automaton in a right re- 
duced form 



Table 2. Operations on automata 



Description 


Boolean 

function 


Multiplicity func- 
tion 


Shortcuts 


Parameters 


returns the complementary au- 
tomaton w.r.t. the alphabet U 


Comp 


_ 




aut : automaton 
U : set of letters 


returns the concatenation 

(Cauchy product) of the 

automata autl and aut2 


Concat 


Cauchy 


&C, &c 


autl, aut2 : 
automata 


returns the automaton of aut* 


Star 


star 


8zs 


aut : 

automaton 


returns the automaton inter- 
section (Hadamard product) of 
autl and aut2 


Inter 


Hadamard 


&/, Szi 


autl, aut2 : 
automata 


returns the automaton shuttle 
product of autl and aut2 


Shuffle 


shuffle 




autl, aut2 : 
automata 


returns the automaton mirror 
of aut 


Mirror 


_ 




aut : automate 


returns the automaton ditler- 
ence of autl and aut2 


Minus 


- 


&zM 


autl, aut2 : 
automata 


returns the automaton ot the n 
times concatenation 


Power 


_ 


&p 


aut : automaton 
n : integer 


returns the automaton ot the 
left quotient of aut2 w.r.t. autl 


LeftQuotient 


- 


&zL 


autl, aut2 : 
automata 


returns the automaton ot the 
right quotient of autl w.r.t. 
aut2 


RightQuotient 


- 


$R 


autl, aut2 : 
automata 


returns the union(sum) au- 
tomaton of autl and aut2 


Union 


sum 


&zU, Siu 


autl, aut2 : 
automata 


returns the automaton external 
product of aut by a 


_ 


ext 




aut : automaton 
O' : scalar 


returns the automaton ot the 
series S~^ s.t. S- S~^ — 1 


_ 


inv 




aut: automaton 
of the series S 


returns the automaton infiltra- 
tion product of autl and aut2 


- 


ip 




autl, aut2 : 
automata 
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Table 3. Miscellaneous functions 



Tables of operations 



Boolean 

function 


Multiplicity 

function 


Parameter 


Description 


Alphabet 


_ 


aut : automaton 


returns the alphabet of the automa- 
ton aut 


Aut 


- 


a : letter 


Build the automaton of the letter a 


StToDf 


_ 


aut : automaton 


Builds a deterministic format tran- 
sition table from a standard one 


DfToSt 


_ 


aut : automaton 


Builds a standard format transition 
table from a deterministic one 


StToLf 


_ 


aut : automaton 


Builds a lists format transition table 
from a standard one 


LfToSt 


_ 


aut : automaton 


Builds a standard format transition 
table from a lists one 


AreEquiv 


_ 


autl^aut2 : automata 


tests it the languages recognized by 
autl and aut2 are equal 


States 


- 


aut : automaton 


returns the set of states of aut 


RecognizedBy 


_ 


w : word 

aut : automaton 


tests whether the word w is recog- 
nized by the automaton aut 


BoolToMult 




aut : automaton in a 
standard format 


gives a matrix format tor the 
AMULT package from a standard 
format 


_ 


MultToBool 


aut : automaton in a 
matrix format 


give a standard format from a ma- 
trix format 



3.3 Example of Session 

After loading the two packages ABOOL and AMULT, we present some possible 
computations. Consider first the two regular expressions E\ = {ah)* i±A{ab)* and 
E2 = (a{ab)*b)*, if the set of coefficients is the boolean semiring. Clearly, using 
the function AreEquiv of the package ABOOL, we find that they represent the 
same language. Indeed, we give the minimal automaton by the function Mini. 
However, if the set of coefficients is the field Q, we can observe that multiplicities 
naturally appear in the linear representations issued of rl and r2 associated 
respectively with Ei and E 2 . Using the fonction mini of the package AMULT, the 
resulted minimal linear representations are not isomorphic although the supports 
are equivalent (their regular expressions are respectively E\ and E 2 ). 

On this example, some last remarks can be made. In fact, the multiplicities 
being positive, we may substitute them by 1, and then we obtain an automaton 
in the boolean context. But, generally, this conversion is not immediate with 
the appearance of negative numbers. We must be carefull also that minimal 
linear representations with positive multiplicities does not imply the minimality 
(here in the number of states, the automaton may be not deterministic) of the 
corresponding automaton in the booleans. For example, if we take the N-rational 
(or Q-rational of course) series S = E|t„l_i_ir(; where Fn is the n*^ Fibonacci 

number, the minimal linear representation is 
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( 10 ): 






the transitions matrices being the same for each letter of the alphabet. But the 
minimal boolean automaton of the support (A*) holds only one state. 





— 


Maple V — b 


-session.rrs 




1 File Edit View ^tions 






Help 
















1 Pause 
















> with(ABOOL); 5 

{Alphabet, AreEquiv, Aut, Comp, Concat, Deter, DfToSt, Inter, LeftQjotiera, LfToSt, Mini, Minus, Mirror, Power, RecognitedBy, RigktQjatiera, 
Shuffle, StToDf, StToLf, Star, States, Std, Trim, Urdon] 


> A:=Aut(a):B:=Aut(b): 




> E2:-&S(A &C (&S(A&C B)) &C B); 


El := [\, (1, 5), table([ 

3 = table([ 

£. = {4) 

1) 

4 • table([ 

<r = {3) 

£. = {S) 

1) 

5 . table([ 
c = {2) 

)) 

1 =rable([ 
o = {2) 

1) 

2 . tabk([ 
c = {3) 

i. = {5) 

1) 

])] 




> AreEquiv(Deter(E1),Deter(E2)); 


true 




> StToDf(Mini(E2)); 




[l,{l},table([ 
c =[2,3, 4, 4] 
6 = [4,1, 2, 4] 
])] 






[Maple Memory: 895K| [Maple CPU Time: 1,6sec| [interTace Memory: 22.0K[ 



The ABOOL package is first loaded. Indeed, we compute the automata of the 
two letters a and b, and then we build the automata of E\ and E2 ■ The function 
AreEquiv allows us to verify that the two deterministic automata representing 
the languages L{Ei) and L{E2), recognize the same language. 
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jIapleV — k-session.ms 



File Edit View ^tions 



—I 

Help I 



Input 



Interrupt 



Pause 



with(AMULTl; 



[Cauchy, Hadamard, inv, ip, mi/ti, reduce, Tsvm, seed, zhujjk, star, teris] 



au:=[matrix(1 ,3,[1 ,0,0]),[matrix(3,3,[0,1 ,0,0,0,0,0,1,0]),matrix(3,3,[0,0,0,0,0,1 ,0,0,0])],matrix(3,1 ,[1,0,1 ])] 







0 1 o' 


0 0 0]' 


1]' 


<m := 


[1 0 0], 


0 0 0, 


0 0 1 


, 0 






0 1 0. 


0 0 oj. 


u 



> r1 :=au &w au: 



> r2:=&s ( a &c (au &c b)); 



[0 0 0 0 0 0 0 1 ], 



1 0 
0 0 



> ml :=mini(r1); 




[0 2 0] 


fO 0 Oil 


[ 1] 




ml ;= 


[1 0 0], 


0 0 2 


1 0 0 


, 0 








0 0 0. 


0 1 oj. 


0. 




> m2:=mini(r2); 




0 0' 


0 0 ol' 


[ 1] 




m2 


[1 0 0], 


0 0 1 


1 0 0 


, 0 








0 0 0. 


0 1 oj. 


0. 




|> AreEquiv(Trim(Mini(E2)),Std(MultToBool(m2))); 













True 








I' 







Next, the AMULT package is loaded. We compute the linear representations 
of Ti and T 2 ■ The implementation of the building of automata from the regular 
expression are identical for the two packages. Just it appears the external product 
in AMULT. After we obtain the minimal linear representations with the function 
mini, the conversion from the multiplicities into the boolean (multiplicities are 
replaced with 1) is made by the function MultToBool. 



4 The SEA Environment 

SEA is a tool provided for symbolic computation on automata under Maple 
system on Unix platforms. The two major functionalities of SEA are automata 
Maple packages structuring and grouping, and readlib defined packages sys- 
tematic generation. 

When you develop a bunch of Maple functions you finally group and structure 
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them into packages, because doing this way you can write a more easily main- 
tained, reusable code. 

Maple enables the developer to create simple packages which are just tables 
in which the indices are function names and the entries function bodies. Saving 
this table into a “ . m” file in the correct location permits anyone to load the 
package further during another Maple session, using the with command. 

When the functions are cumbersome and numerous, you may want to load 
them only when needed, rather than systematically at the beginning of the 
session. This can be achieved via the readlib Maple command in the following 
way. For every index named f in the table package, remove its entry and replace 
it with the following protected form: 

’readlibC’f ’ , <file>)’: 

where <file> is the absolute name of the file containing a definition for your f 
function (and possibly other things). Thus loading the table package is done very 
quickly because it does nothing but defining where your functions are located. 
At first call, your function will be loaded and the remember table of the readlib 
command prevents the system to load it again. 

4.1 How to Write Your Package 

First of all, you have to write down your functions. We distinguish between two 
kinds of functions: those that make up the public part (the interface) of the 
package, accessible for the client user, and those that are somehow private and 
thus theoretically inaccessible^ for the client user. We call external the first kind 
of functions, and internal the second one. Notice that you can refine again the 
second category of functions by defining those that are used by only one external 
function {satellite functions), and those that are used by many {tool functions). 

For the package to be successfully managed by SEA, you have to ensure that 
your source code satisfy the following requirements. 

1. There is only one external function (and possibly many satellite functions) 
per source file (generally named according to the external function). 

— The signature of the external function begins with the following regular 
expression within its source file 

~[\t, ] * ‘ <pnam>_EXT/<fncim> ' [\t , ]*: = [\t, ]*proc[\t, ]*(.*$ 

where <pnam> and <fnam> are the package name and the external func- 
tion name respectively. Notice that the parameters list can extend over 
many lines. 

— The signature of internal functions begins with the following regular 
expression within the source file 

~[\t, ] * ‘ <pnam>/<f ncmi> ‘ [\t , ]*: = [\t, ]*proc[\t, ]*(.*$ 



^ In fact there is no information hiding in Maple so that you can ignore the privacy 
of these functions; but they may be undocumented, meaning that you do not have 
to use them. 
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— Every function call must be in Maple’s long format: 

• use <pnam> [<fnam>] (<parlist>) to call the external function 
<fnam> with the <parlist> list of parameters, lying in the <pnam> 
package, 

• use '<pnam>/<fnam>' (<parlist>) to call the internal function 
<fnam> with the <parlist> list of parameters, lying in the <pnam> 
package. 

2. There is one file per help function (which looks like ‘help/text/<fncmi> ' 

: = TEXT ( . . . ) : ) , and one help function for every external function and for 
the package itself. 

3. There is only one file (named declext) for all the external constants and 
variables, which will appear in the package table. 

4. There is only one file (named declint) for all the local types and tool func- 
tions, which will not appear in the package table. 

Maple documentation tells us that we can write a ‘ <pname>/ init ' function 
which is automatically executed before any with command call, until the (global) 
variable ‘<pname>/Initialized‘ becomes true. You can use this trick to do 
some initial work prior to execute any of your package function. 

Every file you provide will be precompiled “as is” into Maple internal format 
so that you can put comments into your source or help files safely. 



4.2 How to Use SEAER Package 

You are given an error package to manage errors from one place, providing a 
standard output format for all error messages. 

The package is named SEAER and it contains (at the present stage) a unique 
function named SEAER [SeaErr] . 

Its standard calling sequence is SeaErr (narnie , errno) where namie is the name of 
the client function (be. its procnamie) and errno is the identifier of the detected 
error. At the moment, there is only two public error identifiers 

— SEAER [INP] meaning Invalid Number of Parameters, and 

— SEAER [IPT] meaning Invalid Parameter Type. 

Further error identifiers may be defined on request to the technical administra- 

toifl. 

^ We stress the fact that you can use SEA in two different manners. On the one hand, 
you can use SEA freely for your own sake: compile any kind of Maple package (even 
not related to automata theory), use your own error function(s) or adapt the SEAER 
package to your convenience, modify any sources, and so forth. On the other hand, 
if one wants to add functionalities within SEA, he (or she) has to make a proposal to 
the administrator. Anyhow, distribution under the name SEA is only allowed when 
the original unmodified SEA package is concerned. 



SEA: A Symbolic Environment for Automata Theory 



25 



$SEAPATH 
+- CONTRIB 
I +- ABQOL 

I +- AMULT 

I +- YRPKG 



I +- DOC 

I I +- YRPKG-usr.ps 

I I +- YRPKG-ref.ps 

I I +- readme. 1st 

I +- SRC 

I I +- ENVT 

I I I +- declext 

I I I +- declint 

I I +- FCTN 

I I I +- FirstFun 

I I I +- LastFun 

I I +- HELP 

I I +- FirstFun 

I I +- LastFun 

I I +- YRPKG 

I +- XMPL 

I +- FirstXmpl.m 

I +- SecndXmpl.m 

+- SEAER 



+- NEWS 
+- SEASTEM 



4.3 How to Incorporate Your Package into SEA 

When you are ready with your source code, you can put your package (say 
YRPKG) into SEA. 

First you have to set the SEAPATH environment variable to the directory 
where SEA resides (for example /home/lnamie/Maple/SEA). Then you create a 
tree structure in the $SEAPATH/CONTRIB directory containing all your source files 
(Maple code and documentation). Now you have to install your package, that 
is you have to precompile it into Maple internal format, build the package files, 
and place them into your local library. This is quite simple because you just have 
to command 



$SEAPATH/SEASTEM/UTL/inst [-v VerboseDir] MapleLibDir [PkgName] * 



The V option means verbose, so you can look what inst really does for you in 
the VerboseDir directory. The MapleLibDir parameter is the target directory 
for all your compiled files (for example /home/lname/Maple/Lib/SEA). Finally 
you give the name(s) of the package(s) you want to (re)install, none means all. 

Here you are! Now you can run Maple and load any package you want, or 
simply command with(SEA) : at the prompt. 
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5 Conclusion 

The joint development of the two packages yielded a two way interaction between 
theory and implementation. This work has generated many theoretical questions 
closely related to the two theory in presence. 

Moreover, we have seen that extended rational laws are better understood 
(and implemented) as dual laws deriving from suitable co-products. The com- 
patibility of congruences with these coproducts have been completely solved in 
certain cases, the general problem seems to be manifold and remains difficult. In 
particular, among next steps we plan to implement finite semirings with some 
new features coming from p-combinatorics. 
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Abstract. In this paper, we develop theoretical, as well as practical, 
tools for the synthesis and the verification of processes that contain n 
timers. Such tools are equally adapted to numerical calculations as to 
symbolical ones, thus allowing for parametric analysis. 

The results we have obtained rely on a simple and efficient representation 
of the states of an automaton that recognizes the behaviors of the pro- 
cess. This representation is based on a mechanical structure which helps 
us encode the states in a compact manner and leads to simple algorithms. 

Keywords: automata, real-time systems, verification, synthesis. 



1 Introduction 

We are interested in reactive systems whose behavior depends on both the events 
of a process and the start and timeout events of n timers. The timers are started 
individually or in groups with the occurence of an event. The timeout event 
of a given timer can, in turn, influence the behavior of the process. This kind 
of system can be found in all applications where a given delay between two 
actions or events must be met or exceeded. Transportation or telecommunication 
networks are good examples of such systems. 

Consider, for example, a simple process such as the one modeled in Fig. [D 



Begin-Cooking 




Fig. 1. A simple process 



O. Boldt and H. Jiirgensen (Eds.): WIA’99, LNCS 2214, pp. 27-^^ 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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This automaton describes the logical sequences of events of the process, but 
does not capture any timing constraint between them. These must be stated as 
additional properties such as: 

(1) The events Begin-Cooking and Stop-Cooking must be separated by a 
delay of at least di minutes. 

(2) The events Begin-Cooking and Serve must be separated by a delay of at 
most d,2 minutes. 

One way to specify this type of constraint is to associate with each constraint 
a symbolic timer. 

Definition 1. A symbolic timer T is given by a real positive constant d, and by 
the following automaton: 



S 




T 



In its initial state, the timer is inactive. Event S starts the timer. After a 
delay d, event T occurs and the timer returns to its initial state. In each state, 
null loops labeled allow for the synchronization with external events. Thus a 
timer models the language of alternating events S and T, possibly interspersed 
with other events from external processes. 

When two timers function in parallel, the possible events can be described 
by vectors of the form: 



ei 

.® 2 . 

where e\ is an event of the first timer, and 62 is an event of the second timer. 
Using this notation, the sequence of timer events that would be associated with 
a correct sequence of events of Fig. [D would be: 



Begin- Stop- Serve 

Cooking Cooking 



Ti 


S 




T 




— 




— 




— 


T 2 


S 




- 




- 




- 




T 
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With this simple example, we see that such a sequence is possible if and only 
if di < d ,2 where di is the delay of timer 1 and c ?2 is the delay of timer 2. 

Note that this type of condition on the delays depends only on the sequence 
of timer events, and not on the events of the original process. We will thus set 
aside, for the time being, all external processes and focus our attention on the 
possible event sequences of a system of n timers. Let us assume that the delays 
of the n timers are given by 



D — (di, . . . , dji) . 

We will consider the following problems: 

(1) Verification: Given the delays D = {d\, . . . ,dn), what are the possible 
event sequences? Is the set of all such sequences a regular language? 

(2) Synthesis: What constraints must be placed on the vector D in order for 
a given set of behaviors to be possible? Or impossible? 

In order to study these problems, several models have been put forth, the 
best known of which is undoubtedly that of timed automata This theory 
notably shows that behaviors are recognizable by finite automata in the case of 
rational delays. Moreover, in we find a proof of recognizability in the case 
of n = 2, for all real delays. In the following paragraphs, we develop techniques 
that allow for the verification of systems with parametric delays. 

Problem number (2) is somewhat more difficult 0, 0- A systematic process 
must be developped which will generate timing constraints ensuring that a given 
set of behaviors is possible or impossible. We will introduce algorithms that 
provide simple decision making procedures for the enumeration of constraints. 



2 Event Sequences on n Timers 

A system of n timers with delays D = (di, d^, ■ ■ ■ , dn) is composed of timers that 
function in parallel. 

An event of the global system of timers is a vector whose components belong 
to the set {S,T,—}. We are interested in the possible behaviors of systems of 
n timers and therefore in the set of event sequences. For example, the sequence 
abode: 



abode 




is defined on a system of 5 timers. 
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Such a sequence is said to be possible if we can place its events on a time 
line, like the one in Fig. 0, such that the following conditions hold: 

(1) The events are separate. 

(2) The distance between the event in which timer i is started and the event in 
which it times out equals di. 

(3) If timer i is still running after the sequence of events, the distance between 
the event that last started it and the reading point is strictly less than di. 



a b c d e 

^ ^ ^ ^ ^ ^ 

reading 

Fig. 2. A time line associated with the sequence abode. 



For a given time line, the readings of the timers are given by the vector 

R = (ri,r 2 , ...,rn) 



where rt is the distance between the event that last started the timer and the 
reading point. The value of r* for inactive timers is irrelevant and is set to 0. 

If we know the value of R, we can determine the possible next events of the 
system: in fact, we can predict which timer will time out next. 

In general, there exists an infinite number of time lines associated with a 
given sequence y. Thus, if the set of readings that corresponds to the set of time 
lines is known, we can predict all possible next events of the system. 

Definition 2 . Let y be a sequence of events on n timers, the state Sy is the set 
of all possible readings after the sequence y. 

It is well known |S|, fl] that a state Sy is a convex polytope of dimension 
k < n' , where n' is the number of timers that are still running after the sequence 
y. Representations of such states are often complex and difficult to manage. In 
the following sections, we will assign a mechanical representation to a state, 
allowing us to develop simple algorithms for computing states and transitions. 



3 A Mechanical Interpretation of a State 

The distances between events of a sequence are bound by certain constraints. 
For example the distance between an event that starts a timer and the event in 
which it times out is constant for all possible time lines. In general, we have: 

Definition 3 . Let 6162... be a possible event sequence. Two events of the 
sequence are said to be linked if the distance that separates them is the same in 
every time line. 



Analysis of Reactive Systems with n Timers 



31 



It is the equivalence relation generated by the pairs of events 

(ek,ei) 

where is an event that starts timer i and e/ the next event in which timer i 
times out. In the example given earlier: 



Ti 
T2 
Ts 
T4 
Ts 

the events {a, c, e} form an equivalence class because of the pairs (a, e) in which 
timer 1 is started and times out, and the pair (a, c) in which timer 2 is started 
and times out. 

The above definition allows us to introduce a representation of a time line 
that captures the essential constraints on the positions of the events of a sequence 
(Fig. ED. 



abode 



s 




— 




— 




— 




T 


s 




— 




T 

s 




s 




— 







s 














S 



a b c d e 

H ^ ^ ^ H 



a b c d e 




Fig. 3. A state. 



In this structure, each event of a sequence defines a vertical rod possessing 
eyelets, and welded to one horizontal rod. Linked events are welded to the same 
horizontal rod thus, a horizontal rod represents an equivalence class of events. 
The right end of a horizontal rod represents the next event of the rod, i.e. the 
moment when the next timer belonging to the rod will time out. 

Apart from its illustrative qualities, the representation of Fig. El allows us to 
reason in an analog way on the properties of a state. For example, it isn’t hard 
to convince ourselves of the following: 
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Observation 1 There is always a certain play between two horizontal rods. 

Since the events are separated, two rods will be separated by the minimal 
distance between pairs of events of the two corresponding classes. 



Observation 2 By sliding the rods horizontally, we can obtain all possible read- 
ings for a given state. 

Each possible time line coressponds to a position of the mechanical structure. 
And one time line can be obtained from any other by modifyng the distance 
between events that are not in the same class, that is, sliding a rod with respect 
to another. 



Observation 3 There exists one and only one position of the structure in which 
each timer has a minimal reading: it is obtained by pushing all the rods to the 
right, against the vertical rod representing the reading point. This position is 
called the minimal position. 

Another way to obtain this minimal position is to rotate the structure clockwise 
for 90° while holding the reading rod, and let gravity do its work. 

The structure captures all the essential caracteristics of a state. For the time 
being, such a construct depends on the complete knowledge of an event sequence. 
It is therefore of little use for the construction of an automaton recognizing the 
set of possible event sequences. However, in the following section, we will see 
how all the possible movements of this structure can be encoded by a small set 
of values. Given a state and a possible event, the next state of the automaton 
will result from simple mechanical operations on the rods. 



4 Description of a State 

A state must contain sufficient information to predict which events are possible, 
and what should be the next state after any possible event. In this section, we 
will show that the following numbers do indeed describe a state. 

Let Sy be a state, that is a set of possible readings for n timers after the 
sequence y. Define, for any pair of active timers i and j, the numbers: 

Mi,- = max (r, — r,-) 

and, for each timer, its minimal reading in state Sy: 

mi = min (rA 

(ri,r2,...,r„)eSy 

In the mechanical representation of a state, Mij is maximal (oriented) dis- 
tance between the events that start timers i and j. In order to compute them, 




Analysis of Reactive Systems with n Timers 



33 



we start with the minimal position described in Observation 3, on the left of the 
following diagram. We then push the rod containing timer k - called, for short, 
timer k - to the left until the reading rod is forced to move. 



1 




reading rod 



reading rod 



Once in that position, illustrated in the right part of the diagram, the distance 
between the start events of timer k and any other timer i is minimized, if timer i 
was started before k, and maximized, if timer i was started after timer k. Thus, 
Mki = Tk — Ti for the readings corresponding to that position. 

Note that in this last position, since ri < di, we have that 

^ ^ki di^ 

but since we pushed timer k to its maximal left position, there must be at least 
one timer for wich Vi = di, thus we have: 

Proposition 1. The maximal reading of timer k is given by: 



Mk = minjMfci + di} 



where i ranges over all active timers, including timer k. 

The numbers Mij give other clues about the nature of the constraints in state 
Sy. In the mechanical representation, two timers are linked if their start event 
are on the same rod. More formally: 

Definition 4. Two timers are linked in state Sy if the difference in their read- 
ings is constant. 

We have immediately: 

Proposition 2. Timers i and j are linked if and only if 

M,j = -Mji. 

Finally, we say that two timers are synchronized if they must time out to- 
gether. This can be formalized as: 

Definition 5. Timers i and j are synchronized if they are linked and if 

di — mi = dj — mj 

With these definitions, it is possible to give an elementary test to detect 
which timers can time out in a state. We have: 
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Theorem 1. Timer k can time out in state Sy if and only if 

dk ^ki ^ di 

for all active timers i not synchronized with k. 

Proof. If timer k can timeout, then it was possible to push it completely to the 
left from the minimal position: 





reading rod 



reading rod 



Clearly, d^ — M]^i < di by definition of the numbers If dk — Mki = di and 
timer i is not synchronized with k, then timer i must time out before k since the 
timeout event of i and k must be separated and timer k cannot timeout before 
i does. 

On the other hand, if dk — Mki > di, then it was possible to move timer k 
all the way to the left, so timer k can time out in state Sy. ■ 

Theorem 1 gives us the possible time out events in a state. In order to be able 
to predict which events are possible in a state, we must solve a last problem. Sup- 
pose that two timers can time out in a state, can they time out simultaneously? 
The following proposition answers this question: 

Proposition 3. If timer k and I can time out in a state Sy, then they can time 
out simultaneously. 

Proof. It is sufficient to show that there is at leat one position where the readings 
of timers k and I are respectively dk and di. Starting with the minimal position, 
we first push timer k to the left: its reading will be dk since timer k can time 
out. Then we push timer I to the left until timer k starts moving or until the 
reading of timer I is di. If the reading of timer I is di, we are done. If not, we get 

Mik + dk < di 

since we reached the maximal (oriented) distance between timers k and 1. Thus 

dk < di — Mik 

implying that timer I could not time out in the first place. ■ 

Finally, we can give the complete description of possible events after a se- 
quence y. We have: 

Theorem 2. A event e is possible in state Sy if and only if: 

a) The set of timers that are started in e is a subset of the inactive timers. 

b) The set of timers that time out in e is a subset of the timers that can time 
out, together with their synchronized timers. 
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5 Computing the Next State 

In this section, we will show how, given a possible event e in state Sy, then we 
can compute the new values of Mb and m'. 

The first step is to compute the new values m'. These are obtained by pushing 
the rods that time out in event e. 

Proposition 4. Let N be the set of timers that are started in event e, and T 
he the set of timers that time out in event e. Then for all active timers, 

, _ J mtj ifT is empty, 

* maxfcgT{dfc — Mki} otherwise 

and for all i € N 

m' = 0. 

Proof. From the minimal position, we push all the timers that time out up to 
their limit. This is possible by Proposition 3. 

If timer i moves when we push timer k, then r[ = d^ — Mf^i. If it does not 
move, then r[> dk — Mki, thus for at least one k, m^ = dk — Mki- 

For new timers that are started, it is immediate that rrii = 0. ■ 

We next compute the values Mb for timers already active in Sy. Suppose, 
for example, that timers k and I time out in event e and consider the following 
diagram obtained after pushing the timers that time out in event e: 



1 


1 

J. 






1 


I 

I 

I 


j 




I 

I 

I 





reading rod 



In order to compute Mb we must push timer i to the left. When we push 
it, either timer j will move first - in this case Mb = Mij -, or timer i will 
have reached its maximal position without touching timer j. In this case Mb = 
Mi — to' . We thus have the following: 

Proposition 5. If timers i and j are active in state Sy then 

Mb = min{Mij, Mj — to' } 

When several timers time out and are started in event e, links will be created 
between events. The next operation is to fuse together the rods that just timed 
out and the new rods created. Indeed, a new timer will be linked to any timer 
that was linked to a timer that just timed out. These relations are captured by: 
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Proposition 6. Let N he the set of timers that are started in event e, and T 
he the set of timers that time out in event e. Then for i,j G N, and all active 
timers k: 



M'- = M'=0 
ML = -m; 

_ ( Mfc if k is not linked to any timer in T 
I otherwise 

Proof. Clearly, if timers i and j are started together, their readings will always 
be equal, thus ML = ML = 0. 

If timer i is started in event e, then its minimal distance from timer k will 
be m'f., thus 

Finally, if timer k is linked to one of the timers that times out, it will be 
linked to any timer that is started in e, thus M^ = 

Otherwise, the distance between timer k and timer i is the maximal value 
of timer k in state Sy. ■ 



6 Example of Analysis 

As a simple example of constraint generation, let’s consider the sequence: 



a h c d e 




Assuming the delays of the timers are given by the parameters 
(di,d 2 ,d 3 ,d 4 ,d^), we can compute the successive values of M^, mi and 

Mij. 



1. Event a starts timer 1 and 2. By Propositions 1 and 6, we have: 
Mij 1 2 3 4 5 

1 0 0 - - - 

2 0 0 - - - 

3 - _ _ _ _ 

4 _ _ _ _ _ 

5 - _ _ _ _ 

nij 0 0 _ _ _ 

Mj min(c?i, ^ 2 ) min(c?i, ^ 2 ) — — — 




Analysis of Reactive Systems with n Timers 



37 



2. Event h starts timer 4, which is possible since timer 4 is inactive. By Propo- 
sition 6 , and since no timer times out, Mi _4 = M 24 = min((ii, ^ 2 )- 

Mij 1 2 3 4 5 

10 0 — min((ii,ci 2 ) — 

2 0 0 — min((ii,fi2) — 

3 - _ _ _ _ 

4 0 0 - 0 - 

5 - _ _ _ _ 

nij 0 0 — 0 — 

Mj min((ii, ^ 2 ) min((ii, ^ 2 ) — min(<ii, ^ 2 ) ^ 4 ) — 



3. In event c, timer 2 times out and timer 3 is started. By Theorem 1, this event 
is possible if timer 1 is not synchronized with timer 2 and if: 

(I2 - M24 < di ^ d .2 < di 

d2 — M 2, 4 < d4 ^ d2 — min{di,d2) < d^ 

Given that ^2 < di, the second constraint asserts that ^4 > 0, which is trivial. 
So we get a first constraint for the system: d2 < di, and the new state: 

Mij 1 2 3 4 5 

1 0 — d2 d2 — 

2 - _ _ _ _ 

3 -d2 - 0 0 - 

4 min(0,d4 — ^2) — min(d2,d4) 0 — 

5 - _ _ _ _ 

nij ^2 — 0 0 — 

Mj Ml - Ml Ml 

where, by Proposition 1: 

Ml = min(di, ^2 + da, d2 -I- da) 

M§ = min(di — d2, da, da) 

M| = min(min(0, da — d2) -I- di, min(d2, da) -I- da, da) 

wia99@ 

4. Event d starts again timer 2, leading to the state: 
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Mij 


1 


2 


3 


4 


5 


1 


0 




d 2 


d 2 


— 


2 


— ^2 


0 


0 


0 


— 


3 


— d 2 




0 


0 


— 


4 

5 


min(0, d4 — ^2) 




min((i2, <^4) 


0 


— 


nij 


^2 


0 


0 


0 




Mj 


Mf 


Mi 


Ml 


Ml 


- 



where = min(cii — d,2,d2,d3, ^4). 



5. Finally, in event e, timer 1 must time out. This will generate the following 
constraints: 



d\ — Tfi,2 ^ di — ]V[^ d2 
di - Mi_3 < ds ^ di - d2 < ds 
di — Mi^4 < d4 ^ di — d2 < d4 

Given the second and third constraint, Mf = di, so the first constraint 
is trivial, and we get the following constraints that make the sequence abode 
possible: 

d2 < di 
di — d2 < ds 
di — d2 < d4 
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Abstract. In computer science methods to aid learning are very impor- 
tant, because abstract models are used frequently. For this conventional 
teaching methods do not suffice. We have developed a learning software, 
that helps the learner to better understand principles of compiler con- 
struction, in particular lexical analysis. The software offers on the one 
hand an interactive introduction to the problems of lexical analysis, in 
which the most important definitions and algorithms are presented in 
graphically appealing form. Animations show how finite automata are 
created from regular expressions, as well as, how finite automata work. 
We discuss principles used throughout the design of the software and 
give some preliminary results of evaluations of the software and discuss 
related work. 



1 Introduction 

The daily task of a computer science lecturer/ teacher is to teach abstract knowl- 
edge and to promote the correct and lasting understanding of this knowledge by 
the listeners. For example, assume that a lecturer wants to describe the function- 
ality of a pushdown automaton. In the most cases a large board and a sufficient 
number of colored chalk are available. Now he has the challenge to explain the 
functionality of the automaton on the basis of a small example input, a finite 
number of states and a stack picture. After three or four steps he begins to erase 
states in the stack picture, to add new states etc. The listener will have to spend 
more energy to reconstruct the complicated operational sequence of wiping and 
writing and to discover a sense in the disorder than to understand the function- 
ality of a pushdown automaton. Thus the demonstration of such an automaton 
is difficult to reproduce by the learner. Visualization and graphic processing of 
the pushdown automaton are a possible solution for this dilemma. Because of 
dynamic processing animations are first choice for such technical problems. It is 
important to edit the information in such a way that cognitive and affective data 
processing of humans are addressed. The first is sequential and logical reasoning 
based on rules and regularities. The second thinks in pictures, uses analogies, 
ignores rules, reacts spontaneously and creatively. When we look at a suitable 
picture for an abstract term, we use both ’’information channels” and enable the 
connection of the actual term with a graphical imagination. This is also known 
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as ’’integration learning” (see in addition P). Who understands to visualize in- 
formation well, can increase the knowledge and understanding of learners with 
this method. 

2 Animation of Lexical Analysis 

The learning software ’’Animation of Lexical Analysis” has been developed with 
the authoring system Multimedia ToolBook 3.0 of the company Asymetrix and 
requires the free runtime version of ToolBook and Windows 3.x/95/98/NT4. The 
learning software covers the description of regular languages by regular expres- 
sions, theory of finite automata and the generation of a minimal deterministic 
finite automaton from any regular expression as described in |2B|. Currently 
there is only a German version of the software. 

As an introduction to lexical analysis, several animations show the fundamen- 
tal components of a scanner and the cooperation between parser and scanner. 
Then symbols and symbol classes are explained. It is shown, how input symbols, 
lexical symbols, symbol classes and their internal representation are connected. 

Next an overview about formal languages and an introduction to regular 
languages and regular expressions are given. 

Then transition diagrams (TD), non-deterministic (NFA) and deterministic 
(DFA) finite automata are described. There are animated examples for each of 
these that can be controlled by the user. The equivalence between regular ex- 
pressions and NFA’s is explained with an fixed animated example (see Figure 
The user can follow the parallel processing of a transition diagram and an NFA 
with the same input string. Currently, we see a snapshot of the NFA in state 
z4. The next character to be consumed is the character 5. Now the NFA can 
read the character 5 or it can do a transition via e. The animation shows both 
possibilities. Analogously the actual path is highlighted in the TD. The two 
edges from node 4 to node 7 and from node 4 back to node 4 are marked red. 
The shadowed box in the center of the window briefly describes what the NFA 
and TD actually do. In a next step, the animation will color the edge labeled 
E, update the description box, mark the state z4 as actual state and dismiss 
the second transition (z4, e, z7) of the NFA, because the next character to be 
consumed is E. 

Three algorithms are explained with controllable animations: the transfor- 
mation of a regular expression to an NFA, the transformation of an NFA to a 
DFA and the transformation of a DFA to a minimal DFA. 

Figure |2I shows the rules of the algorithm regular expression to NEA that 
transforms a regular expression into an NFA. In an animated example it is shown 
how the algorithm works. It begins with a graph consisting of two nodes and 
one edge that is labelled with the regular expression. Step by step the suitable 
rule is applied (alternative, concatenation, Kleene star or parentheses) and the 
graph is expanded to the resulting NFA. 

In the algorithm NEA to DEA, the original NFA and the text of the algorithm 
are initially shown. With each step the corresponding line of the algorithm is 
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Fig. 1. Equivalence of Transition Diagram and NFA 



highlighted and the actual nodes and edges in the graph are colored. It can be 
seen which nodes from the NFA build a new state in the DFA. Simultaneously 
to the processing of the algorithm, the new DFA is created. 

Similarly the algorithm DFA to minimal DFA shows the original DFA and the 
algorithm text. The partition classes are shown in the original graph (through 
coloring) and the minimal DFA is created simultaneously. 

3 Design Principles 

A prerequisite of implementing a good learning software is the application of 
good design principles. These principles were developed before the implemen- 
tation of our learning software and revised during the implementation pro- 
cess. Some of these principles result from the research on Human-Computer- 
Interaction (HCI), see We propose the following guidelines: 

3.1 Text 

— Font size: If the font chosen is too small, then the user will have problems to 
read the text correctly, in particular sub- or superscripted text. If the font 
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Fig. 2. From Regular Expression to NFA 



is too big, then the designer of the learning system has to solve the problem 
of placing enough information on the page. 

— Alignment: Justification is not suitable for small text widths. In this case we 
prefer left alignment. 

— No serifs if the font is small: Small fonts with serifs are difficult to read, 
because monitor resolution is not compareable with printer resolution. In 
computer science formulas with superscripted or subscripted letters are used 
frequently. This letters are very bad to read if we use a font with serifs. 

3.2 Colors 

— Use few colors only: Too many colors can irritate the user of the learning 
software. But colors help to direct the user’s attention. Therefore colors 
should be used for things, to which the user’s attention should be drawn. 

— Colors should harmonize: The use of a light background color doesn’t al- 
low light font colors. The contrast between the background and the objects 
located on it should be high enough. 

— One fixed color for one fixed meaning: Colors for certain links or buttons 
should not be changed or merged, e.g. the color blue is used for ’’hotword” 
links in our software. 
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3.3 Screen Arrangement 

— Main activity in one window only: The attention of the user should be di- 
rected to one goal only. Too many windows on the screen can promote dis- 
orientation. 

— One lesson on one page if possible: In order to avoid disorientation each 
basic lesson is arranged on a single page. Additional information is reached 
by using ’’hotwords” or buttons. 

— Consistent design: Certain window areas should always be on the same place, 
e.g. the control buttons of the animations are consistently located in a special 
bar below the main window. All links must have one fixed color in case of 
’’hotwords” or one fixed symbol in case of links to animations, definitions, 
etc. 

— No overloading of windows: If a window contains very much text and many 
animations, then the user has difficulties to understand the important infor- 
mations. 



3.4 Definitions 

— Accumulate definitions: All definitions relevant to lexical analysis are accu- 
mulated in an independent window. A first advantage is the space reduced in 
the main window, which is important, if the definition is very long. Further- 
more the user has an overview of all definitions and can look up definitions. 
They can be sorted in alphabetical order or in succession of their occurence 
in the explanatory text. 



3.5 Orientation and Navigation 

— Easy navigation and good orientation: The actual chapter and section of the 
lessons are shown in a state bar located below the main window. The user 
can navigate to the index and from this point to another page by clicking 
an Index-hntton in the state bar. 



3.6 Animations 

— Flexible control: Animations should be adjustably in speed. It should be pos- 
sible to execute them step by step, to stop and to reset them at each point 
in time. A reversed execution of animations is not in all cases meaningful 
and also it is frequently technically difficult to realize. However, as an al- 
ternative an Undo-operation is appropriate, that allows for a finite number 
of backtrack steps. Control buttons should have a well-known look, perhaps 
like the control buttons of a cassette recorder. 

— Clearly defined object movements: Movements of objects should be made 
as directly as possible to their target, but not move over too many other 
objects. Several objects, that are not logical coherent, should not be moved 
at the same time and the movement should not be too complex and jerky. 
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— Direct feedback of user actions: Particularly in context with animations an 
optical feedback of the software is important, if users interact with the sys- 
tem. If an animation is stopped, then this stop should take place immediately 
and the animation should not continue a undefined amount of time. 

— Minimal memorization requirements for users: Animations and appropriate 
assertions should run within a spatial and logical framework. With dynamic, 
automatically generated animations at runtime, this principle often conflicts 
with high requirement for space on the screen. 

— Spatial requirements of an animation: More complex animations should take 
place on a page its own. Smaller animations should be arranged near their 
textual explanation. 

The most challenging task when developing the learning software was to try to 
satisfy as much as possible of the above partially conflicting requirements. 



4 Evaluation 

Our target groups are students, who take a computer science course at high 
school, as well as students of computer science. In a pre-evaluation we left the 
students alone with the system. They should move independently through the 
graphical environment of the system and discover the learning contents on their 
own. With this approach we made good experiences, whereby we presupposed 
that the students have already been familiar with the operating system Microsoft 
Windows. The students got along well with the learning system, since they met a 
well known graphical user interface. If previous knowledge in the compiler design 
was available, then we noticed a better acceptance of the system, as with stu- 
dents, who knew still nothing about the construction of compilers. The students 
moved playfully through the visualizations and animations and were also able 
to connect these correctly with the theoretical background (definitions, algo- 
rithms, . . . ). Surprisingly this method worked so well that the students referred 
us to inconsistencies and typing errors in definitions. Students liked the optical 
organization of the user interface and animations. 

Further presentations of our system at teachers advanced training (Interna- 
tional Conference and Research Center for Computer Science, Dagstuhl Castle) 
and at the booth of the University of Saarland on the computer fair CeBIT98 
and CeBIT99 in Hannover (Germany) provided positive feedback. However these 
measures are not yet sufficient for a serious evaluation. For this reason we co- 
operate at present with cognitive psychologists and develop an experiment for 
schools and universities, in order to receive statistically significant data of our 
software. For this certain aspects and characteristics of our work, e.g. the page 
layout or the animation control are regarded separately, while all other variable 
system properties remain unchanged. The use of visualization and animation is 
confronted to the use of the doctrine of a teacher. We still are in the preparation, 
so there are no results yet. 
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We have done the pre-evaluation with 8 highschool students (16-18 years old, 
Oberstufe Gymnasium) from a computer science course and got some prelimi- 
nary results: 6 of them would use the system at home, 3 of them had problems 
with the control of the animation, none had problems with ’’hotwords” and only 
one partially disliked the design of the pages and examples. 

5 Related Work 

In recent years at the University of Saarland also other visualizations in the 
context of compiler design have been developed, including visualizations of ab- 
stract machines for imperative, logical and functional programming languages 
(P2i, m and 1221). These visualizations were implemented under X- Windows 
(UNIX) . They show the effects of the execution of machine instructions on the 
run time stack and heap, howewer they contain few animations. Furthermore 
a tool was developed for the visualization of graphs from the area of compiler 
design, called VCG (’’Visualization of Gompiler Graphs”). The VGG tool exists 
for several computer systems, including the Microsoft Windows system. See for 
this |T^ and m- 

Another learning system developed at the University of Saarland is the ’’An- 
imation of Semantical Analysis” ma, m This application illustrates and ani- 
mates the basic tasks of semantical analysis by textual and graphical examples. 
It covers basic knowledge, like the concepts of scoping and visibility, checking 
of context conditions (identification of identifiers, checking of type consistency), 
overloading of identifiers and polymorphism. The corresponding algorithms for 
analysis can be examined with many examples. The user can even enter his 
own example programs and specifications. From these inputs animations and 
visualizations are generated. 

Also there exist a huge number of algorithm animations today, there is only a 
small number of fundamental work in the field. Marc H. Brown developed several 
algorithm animation systems, like BALSA, ZEUS, GAT, etc. These systems are 
frameworks, in which algorithms can be animated by annotations (’’interesting 
events”) and by definitions of graphical views ([3, 0, 0, 0 0)- John T. 

Stasko conceived the path transition paradigm and implemented it in the systems 
TANGO, XTANGO, SAMBA, etc., see ^0]. Also these systems 

use the concept of ’’interesting events”. All newer versions of the above systems 
are complete environments, which offer some editors for the creation of views, 
in which the algorithms are animated. The WEB-based animation system GAT 
(or the newer JGAT, which is implemented in the programming language Java) 
is a complete development environment for the creation of algorithm animations 
in the WWW. This system offers more possibilities than ad-hoc programmed 
Java applets. It is possible to create algorithm animations, which a teacher can 
demonstrate to his students online. The interaction of the students is limited 
thus, but the system represents a step towards the so-called ’’electronic class- 
room” . An animation can be configured in such a way that the students have the 
possibility to intervene interactively. They can control the animation and select 
other views on the algorithm. The paper |0I gives a good outline of most of the 
systems for algorithm animation mentioned above. 
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6 Conclusion 

We have developed a learning software ’’Animation of Lexical Analysis” that 
helps the learner to better understand principles of compiler construction, in 
particular lexical analysis. The software offers on the one hand an interactive 
introduction to the problems of lexical analysis, in which the most important 
definitions and algorithms are presented in graphically appealing form. Anima- 
tions show how finite automata are created from regular expressions, as well as, 
how finite automata work. 

In our current evaluation we would like to find out whether the presentation 
of the learning content through the learning software has pedagogical advan- 
tages and where the software indicates weaknesses. Questions to be answered 
are for example, whether animations can be controlled intuitively, where the an- 
imation controls should be placed etc. From a technical point of view the use 
of the authoring system MTB 3.0 is questionable. It has large restrictions and 
the runtime system takes up much storage space. For these reasons usually im- 
portant sections of the software must be implemented in another programming 
language, like C, when using authoring systems. The advantage of the system is 
its simplicity of operation and programming. 

A new generative approach to learning software is pursued in our current 
project GANIMAL, that is funded by the ’’Deutsche Forschungsgemeinschaft - 
DFG” . The goal of the project is to create an explorative learning software for 
compiler design, in which for each compiler phase the implementation and the 
appropriate visualization or animation are generated from specifications auto- 
matically. To achieve platform independence we use the programming language 
Java. Experience with designing the learning software presented here as well as 
its evaluations will serve as a basis for the GANIMAL project (see also 0, |B1)- 

The experience gained is not only applicable to the technical area of computer 
science, but can be transferred also to other areas, in which processes are to be 
visualized, for instance the medicine, electro-technology, etc. The reader finds 
further information about the current level of development, as well as the newest 
versions of the software in the WWW f24| . 
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Abstract. We study automata-theoretic properties of distances and 
quasi-distances between words. We show that every additive distance 
is finite. We also show that every additive quasi-distance is regularity- 
preserving, that is, the neighborhood of any radius of a regular language 
with respect to an additive quasi-distance is regular. As an application 
we present a simple algorithm that constructs a metric (fault-tolerant) 
lexical analyzer for any given lexical analyzer and desired radius (fault- 
tolerance index). 



1 Introduction 



You are frustrated when you type a UNIX command incorrectly and cannot find 
what the correct spelling is. You may be wondering why the system does not 
give any suggestions on what command you might want to type. Those questions 
concern the concepts of distances between words and neighborhoods of languages 
with respect to a distance and a radius. 

Much work has been done in spell checking and correction, and other online 
dictionary applications using various methods |.bl7liSI9j . Here, we study some 
automata-theoretic properties of different measurements of distances between 
words. 

Let Y be a finite alphabet. By the neighborhood of a word w G S* oi radius 
a with respect to a distance measure S, we mean the set of all words u that 
have the distance measure S{u,w) at most a. We denote this neighborhood by 
E{{w},S,a). Naturally, the neighborhood of a language L of a radius a with 
respect to 6, denoted E{L, S, a), is the union of E{{w}, S, a) for all words w G L. 
A distance 5 is said to be finite if E{{w'\^5,a) is finite for all w G S* and 
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a > 0. Informally, S is said to be additive if its measurement distributes over 
concatenation, and regularity-preserving if -E(-R, S, a) is regular for every regular 
language i? and radius a > 0. 

In this paper, we prove that every additive distance is finite. We also show, 
as our main result, that every additive distance (or quasi-distance) is regularity- 
preserving. Examples of various additive and non-additive distance measures are 
also given in the paper. 

As an application of the main result, we construct a very simple algorithm 
that transforms a given lexical analyzer to a metric (fault-tolerant) lexical ana- 
lyzer for an arbitrary radius (fault-tolerance index). 

The paper is organized as follows: In the next section we introduce the basic 
notation. In Section 3, we define distances and quasi-distances. Our main results 
concerning finite, additive, and regularity-preserving distance measures are pre- 
sented in Section 4. In the last section we define metric lexical analyzers and 
describe a simple algorithm that constructs a metric lexical analyzer for a given 
lexical analyzer and desired radius. 

2 Preliminaries 

We assume that the reader is familiar with the basics of formal languages and 
finite automata in particular, cf. mm- Here we introduce the notation we 
will use in the later sections. 

The symbol S denotes a finite alphabet and E* the set of finite words over 
S. The empty word is denoted by A and the length of a word w € S* by licl. 
The shuffle of words u,v G E*, 

uj{u, v) C E* 

is the set of all words XiyiX2 ■ ■ ■ XmVm such that u = X\ - ■ ■ Xm, v = yi ■ ■ ■ ym, 
Xi^yi G E*, i = 1 ,...,TO, m > 0. The catenation of languages S,T C E* is 
denoted by ST. 

A deterministic finite automaton (DFA) is a five-tuple A = {Q, E,^,s,F) 
where Q is the finite set of states, E is the finite alphabet, s G Q is the initial 
state, F C Q is the set of final states, and j : Q x E ^ Q is the state-transition 
function. If A is defined as above except that 7 is a function Q x E ^ V{Q) 
then we say that A is a nondeterministic finite automaton (NFA). (Here 'P(Q) 
is the set of subsets of Q-) 

The state-transition relation 7 of an NFA is extended in the natural way to 
a function 7 : Q x Af* — >• V{Q). We denote also 7 simply by 7 and the language 
accepted by A is L(A) = {w G if* | 7(5, w) DF ^ 0}. 

3 Distances and Qnasi-distances 

We want to measure the distance between distinct words of E*. Let S' be a set. 
We say that a function S : S x S —>■ [0, 00) is a distance if it satisfies the following 
three conditions: 
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(Dl) 5{x, y) = Q X = y, for all x,y G S, 

(D2) S{x,y) = S{y,x), for all x,y € S, 

(D3) S{x, z) < 6{x, y) + 5{y, z), for all x,y,z € S. 

Condition (D3) is called the triangle-inequality. A function S : S x S ^ [0, oo) 
that satisfies (D2) and (D3) and the weaker condition 

(Dl’) S(x,x) = 0, for all x G S, 

is called a quasi- distance on S. A quasi-distance allows the possibility that 
6{x, y) = 0, for x ^ y. 

Note that if i5 is a quasi-distance on S we can define an equivalence rela- 
tion on S by setting x y iff S{x,y) = 0. Then the mapping 6' defined 
by = S{x,y) is a distance on S/ ^s- (Since 6 satisfies the con- 
dition (D3) it follows that the value of i5'([x]...,_5, does not depend on the 

representatives x and y.) 

Let 5 be a (quasi-) distance on S, K C S and a > 0. The neighborhood of K 
of radius a (with respect to 5) is 

E{K,5,a) = {x G S' I (3y G K) 6{x,y) < a\. 



A natural distance between words of the same length is the so called Hamming 
distance. Since we need to compare also words of different lengths, there is more 
than one natural way to extend Hamming distance. 

Let # be a symbol not appearing in E and put F = SU {#}. For a,b G F 
define 




if a yf &, 
if a = &. 



Define : T" x T” — >■ IN by setting 



n 

A„(xi • • • x„, yi • • • y„) = ^ A(xi, yi). 

i=l 



The prefix- Hamming distance 5pH on E* is defined as follows. Let u^v G E* . 
Then 



5ph(w,u) 



iik = |x| - |u| > 0, 
iik = |m| - |x| > 0. 



The prefix-Hamming distance counts the number of distinct symbols in the first 
min{|w|, |x|} positions of the words u and v and adds to the result the length of 
the remaining suffix. It is easy to verify that JpH satisfies the triangle-inequality 
and, thus, it is a distance. On the other hand, this distance is not very useful 
from a practical point of view because inserting or deleting one letter can change 
the distance of given words by an arbitrary amount (depending on the length of 
the words). 

A better extension is the function which considers all possible ways to pad 
both words and then takes the minimum of the obtained distances. Let u,v G E*. 
Then we define 
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(5H(u,-y) = min{Z\fc(a;, y) | 

k > maxllul, |t;|},a; £ w(m, #''"1“!), y G (1) 

Notice that for all u,v G E* , Sh{u,v) < max{|u|, |z;|}, and = 

min{zi|„j,|(a;,y) | x G w(m,#I"I), y G 

In general, ^ ^max{|u|>|}(2;, y), for every x G w(u, 

y G w(r>, For example, take u = abab, v = baba and observe that 

w(w,#°) = {u},w(z;,#°) = {r;}, Z\4(u,'c) = 4 > <5H(M,n) = Z\5 (m#,#u) = 2. 

It is convenient to look at m as a process. Consider changing a word into 
another word by means of the following three types of edit steps (|S|): a) insert — 
insert a character into a word, b) delete — delete a character from a word, c) 
replace — replace one character with a different character. Edit steps can be ap- 
plied in any order. For example, to change the word abab into baba we can use rule 
c) (replace) four times and we get bbab, baab, babb, baba. We can be more efficient 
by deleting the first character of abab to get bab, then insert a at the end, so with 
only two edit steps we obtain baba. As we have seen below, d}i(abab,baba) = 2; 
it can be obtained by first constructing the extended words abab^ and ^baba 
and then computing their distance. In fact, we have: 

Lemma 1. For all words u,v, coincides with the minimal number of 

edit steps necessary to change u into ujj 

Corollary 1. The function 6}i satisfies (D1)-(D3). 

The function 5 h is a distance by Corollary G1 as it extends Hamming’s dis- 
tance it is appropriate to call it the shuffle- Flamming distance. 

An immediate property of the shuffl e-Hamming distance follows: insertions 
and deletions of the special symbol ff do not count. 

Lemma 2. For all u,v € E* , and i > 0, 6 b_{u,v) = <5h(m, h), for all u G 

Other possible distances can be obtained by varying edit steps (e.g., allowing 
adjacent characters in one word to be interchanged while copied to the other 
word) or by assigning cost functions to edit steps (e.g., capturing the idea that 
the cost of replacing a character is less than the combined costs of deletion and 
insertion). See |2| for more examples of discrete distances. 

4 Neighborhoods of Regular Languages 

Let L be a regular language over E. We are interested in the following question: 
Which conditions the distance 5 should satisfy in order to guarantee that all the 

^ This number is called the edit-distance in P|, pp 325-326; it has been suggested by 
Ulam |12|. 
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languages E{L,S,a), a > 0, are regular? We say that a distance 6 is regularity- 
preserving if E{L,6,a) is a regular language for all regular languages L and 
a > 0. 

It is fairly straightforward to construct examples of distances on E* that are 
not regularity-preserving. Here is such an example. 

Example 1. Let S = {a, 6}. Construct the distance <5 by 
r 0, if u = u, 

6{u, u) = < 1/2, if u = v = for some n,m > 0,n ^ m, 

[ 1, otherwise, 

and notice that E{{ab}, S, 1/2) = {a"6" | n > 0}. □ 

Clearly we need to impose some additional conditions on the distance S. Note 
that the distance in Example Q] has the property that for n > 0 and a > 1/2, 
the inequality S(u, a'^b") < a has infinitely many solutions. Hence, the following 
finiteness requirement seems to be a suitable candidate to guarantee that a 
distance is regularity-preserving. 

We say that a (quasi-) distance <5 on E* is finite if for all w G E* and a > 0, 
the set E({w},S,a) is finite. 

Both the shuffle-Hamming distance and the prefix-Hamming distance con- 
sidered above are clearly finite. The following example shows that finiteness of 
a distance S is, unfortunately, not sufficient to guarantee that S is regularity- 
preserving. 

Example 2. Let E = {a, 6, c}. By slightly modifying the prefix-Hamming dis- 
tance (5pH we construct a finite distance 6 on E* that is not regularity-preserving. 
For u,v G E* we define 

'i _ / 3/2: if M = a^ba"^, v = n>0, or vice versa, 

(5ph(m, i^), otherwise. 

Clearly 6 satisfies the conditions (Dl) and (D2), so in order to show that it is 
a distance it is sufficient to verify the triangle-inequality. Assuming that (D3) 
does not hold, we must have x,y,z G E* such that 

S{x,z) > S{x,y) S{y,z). (2) 

Since for all u,v G E*, S(u,v) > SpH(u,v) and JpH is a distance, it follows that 
if d2J holds, then necessarily S(x, z) yf i5pH(a;, z), that is, x = a^'ba^, z = a”ca", 
n > 0, or vice versa. Thus S(x,z) = 3/2, and (j2I) implies that S(x,y) = 0 or 
S(y, z) = 0. Both possibilities directly yield a contradiction. 

Also, (5 is finite since for any a > 2 and w G E* we have E({w},S,a) = 
E({w},SpH,a). 

To see that <5 is not regularity-preserving choose L = a* ba* . Then 



A(L, S, 3/2) - E{L, 6, 1) = {a"ca” | n > 0}, 
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which implies that at least one of the languages E{L, S, 3/2) and E{L, S, 1) is not 
regular. □ 

The above example shows that we need to look for stronger restrictions for 
regularity-preserving distances. Since elements of S* have a unique decompo- 
sition into subwords (of given length) it is perhaps reasonable to assume that 
the distances should “respect” such decompositions. Thus we say that a (quasi-) 
distance <5 on E* is additive if always when w = W\W2 {w\,W2 G E*) we have 
for all a > 0, 

E{{w},S,a)= IJ E{{wi},S,l 3 i)E{{w 2 },S,P 2 )- (3) 

Pl+02=a 

First we observe that an additive distance is always finite. Note that an 
additive quasi-distance S need not be finite. If, for some b € E, S(b, A) = 0, then 
any (5-neighborhood is necessarily infinite. 

Lemma 3 . Every additive distance is finite. 

Proof. Let i5 be an additive distance on E*. By for any w = bi ■ ■ ■ bk, bi G 
E, i = E{{w},S,a) is contained in the catenation of the languages 

E{{bi}, S,a), . . . , E{{bk}, S, a). Thus, it is sufficient to show that E{{b}, 6, a) is 
finite for b G E and a > 0. 

Let u = Cl ■ ■ ■ Cm, Ci G E , he an arbitrary word of E* . The additivity condi- 
tion implies that u G E{{b}, 6 ,a) iff there exists i G {!,..., m} such that 

^ ^(A, Cj)<Q(. (4) 

There exist only a finite number of words u = ci ■ ■ ■ Cm that satisfy the above 
inequality. □ 

Both the prefix-Hamming distance and the shufhe-Hamming distance are 
additive. 

Proposition 1 . The distances ( 5 pH and i 5 h defined on an alphabet E are addi- 
tive. 

Proof. We show that i5h is additive as the proof for the distance 5pH is simpler. 

Let w = W1W2 be an arbitrary decomposition of a word w G E*. We show 
that for every u G E* , 

u G E{{wiW 2 }, (5h, (a) iff m e [J L^({wi}, ^h, I^i)E{{w 2 }, (5h, [if)- 

Assume Su{u,wiW2) < a. As edit steps (in the process of changing a word 
into another word) can be applied in any order, we can start the process of 
changing u into W1W2 in such a way to obtain first wi from a prefix ui of 
u, and then W 2 (from the remaining suffix U 2 of u). Consequently, Sh(ui, wi) -h 
5h(u2, W2) = Sh(u, W1W2) < a. Conversely, if Ui G E{{wi}, (5h, A), 5 = 1,2, jii-G 
P2 < ex, then we have Sn{uiU2, W1W2) < (5h(mi, wi) + 5 n{u 2 , W2) < a. □ 
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From Example 13 we know that a finite distance need not preserve regularity. 
Below we show that, on the other hand, additivity is a sufficient condition to 
guarantee that even a quasi-distance preserves regularity. Note that, as observed 
above, an additive quasi-distance need not be finite. First we prove the following 
lemma. 

Lemma 4. Assume that S is an additive quasi- distance on S* . 

(i) For each b G S and a > 0, E{b, 5, a) is regular. 

(ii) Let b G S and a > 0 be fixed. There exists an integer k and numbers 
0 = ai < . . . < ak = Oi such that 

E{b, S, ai), i = l,...,k, 

are all the distinct neighborhoods of b having radius at most a. 

Proof, (i) Let u = ci - ■ ■ Cm, m > 0, Ci G E, i = 1, . . . , m. As in the proof of 
Lemma|3it follows that u G E{b,6,a) iff the inequality Q holds. (Note that, in 
contrast to Lemma 0 h is now only a quasi-distance, so this does not imply the 
finiteness of the neighborhood.) 

Denote 

0 = {dGr|(5(d,A) = O}. 

Let F be the set of finite multisets of elements of E, 

{Cj, 5 ■ ■ • ) ^jr\ 

such that <5(A, Cj,) 0, Z = 1, . . . , r and 

r 

5{b,Ci) -k ^(5(A,cy,) < a. 

;=i 

Then u = C\ - ■ ■ Cm satisfies the inequality 0) iff u is the shuffle of a sequence 
obtained by listing the elements of a multiset belonging to (in arbitrary order) 
and a word in &* . The shuffle of a finite language and a regular language is always 
regular. 

(ii) In the construction above the elements of the multisets belonging to W 
completely determine the neighborhoods of radius at most a around b. Thus as 
the radii as, s = 1, . . . ,k, we can simply take all the (distinct) sums 6{b, Ci) -\- 
^ 3 i) where the multiset {ci, , . . . , cj^} belongs to F. (Note that F is 
a finite collection of multisets.) □ 

The above construction implies that Lemma 0 (ii) can be written in the 
following stronger form: 

Corollary 2. Assume that 6 is an additive quasi- distance on E* and let b G E 
and a > 0 be fixed. Then we can write 



E{b, S,a) = RiU . . .U Rk 



where Ri = {w G E* \ S(b,w) = ai}, i = 1, . . . , k, is regular. 
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Proof. Without loss of generality we can assume that the numbers ai in Lemma^ 
(ii) are chosen so that there exists Wi G S* with (5(6, Wi) = ai, i = 1, . . . ,k. Let 
Ri, i = 1, . . . , fc, be as above. By Lemma0 (ii), Ri = E{b, S, ai) — E{b, 6, ai_i), 
i = 2, . . . and R\ = E{b, S, 0). By Lemma0| (i), these sets are regular. □ 

Now we are ready to prove the main result of this section. 

Theorem 1. Assume that 6 is an additive quasi- distance on S* and let L C E* 
he regular. Then E{L, 6, a) is regular for all a > 0. 

Proof. Let a > 0 be fixed and let A = {Q, s, F) be a DFA such that 
L = L{A). Without loss of generality we can assume that the initial state s is 
not reachable from any other state. 

By Corollary El for each 6 G if we can write 

if(6,,5,a) = i?Ju...Ui?^(,), 

where 

R'; = {u> G r* I S{w, b) = aj}, 0 < aj < a, 
is regular, j = 1, . . . , fc(6). Denote D' = {a^ | 6 G if , 1 < j < k{b)} and 
D = {P < a \ (3 = Pi (3r, /3i G D' , 1 < i < r}. 

We construct an NFA B = {Qb,E,jb,sb,Fb) such that 

L{B) = E{L{A),S, a). 



Choose Qb = Q D, SB = {s,0) and 

_ ( F X DU{sb}H Xg E{L{A),6,a) 

^ yF X D otherwise. 

The transition relation •jb is defined as follows. Let q G Q, P G D and 6 G if. 
Then 

(g',/3 + Oj) G 7 b(((?,/3),6) (5) 

for every q' G 7(9, i?J), 1 < j < fc(6), such that /3 + aj < a. (Here j{q,R’j) = 
I V G Rj}-) Since is regular, the set 'y{q,R^) (C Q) can even be 
effectively determined. 

Let w = bi‘ ■ ■ bm, m > 1, bi G S, i = 1, . . . ,m. Since 6 is additive 



w G E{L{A),S,a) iff (3u G L{A)) such that 

uG IJ E{bi,S,Pi)---E{bm,S,Pm). (6) 

/3i + ...+/3m.=a 

In the transitions ®, on input 6 the first component of the states of B simulates 
the computation of A on an arbitrary (nondeterministically chosen) word of u G 
i?J, and in the second component we correspondingly increment the distance by 
a’j = S{b,v). By observation 0, some sequence of the nondeterministic choices 
on input w = b\ ■ ■ ■ b^ leads to an accepting state of Fb iff w is in E{L(A),6, a). 
By the choice of the set Fb, the NFA B accepts A if and only if A G E{L{A),6, a). 

□ 
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5 A Metric Lexical Analyzer 

The major difference between a lexical analyzer and a (traditionally-defined) 
finite automaton is that, in a lexical analyzer, each final state is linked to an 
action (or a set of actions). Because of this difference, the algorithms that are 
designed for finite automata may not directly apply to lexical analyzers. The 
equivalence of two final states in a deterministic lexical analyzer requires that 
not only the states are equivalent in the sense of a DFA, but also that they have 
the same action (or actions). 

There are many other features which are associated with certain types of 
lexical analyzers. For example, some lexical analyzers assume that each input 
word has an end-of-word symbol. Also, many practical lexical analyzers are im- 
plemented using a data structure called trie • However, those features are not 
considered to be common or essential to general lexical analyzers. 

A lexical analyzer can be considered as a special type of Moore machine P) 
with all nonfinal states having the empty output (action). 

For notational convenience, we formally define a lexical analyzer to be a 
7-tuple 

A = {Q,E,r,j,s,F,T) 

where (Q, E, 7 , s, F) is a finite automaton; T is a set of actions; and, t : F ^ F 
is an action-function. Whether A is deterministic or nondeterministic depends 
on whether its underlying finite automaton is a DFA or an NFA. 

Let A = (Q, E, F, 7 , s, F, t) be a deterministic (nondeterministic) lexical an- 
alyzer. Denote by L{A) the set of all words recognized by the underlying finite 
automaton {Q,E,^,s,F). For w G L{A), denote by t{w) the action t{^{s,w)) 
(the set of actions {r(/) | / G F (1 j{s,w)}). We simply write t{w) instead of 
f{w) if there is no confusion. The above definition implies that if w G L(A) and 
the (an) accepting path for w goes through several final states, only the action 
associated to the last final state is activated. 

A simple lexical analyzer A„ is shown in Figure 1. 




A: Execute cd B.- Execute csh C; Execute Is 



Fig. 1. A lexical analyzer A„ 
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Let A = (Q, S, r, 7, s, F, r) be a lexical analyzer, 6 a regularity-preserving 
distance and a > 0 a radius. A lexical analyzer A' = {Q' , S, F' , s' , F' ,t') is 
called a metric lexical analyzer of A with respect to 5 and a if 

(Ml) L{A') = E{L{A),6, a), and 
(M2) for each w € L(A), t'{w) = t{w). 

The proof of Theorem Q gives a general guideline for constructing a metric 
lexical analyzer from a given lexical analyzer, an additive quasi-distance, and a 
radius. In the following, however, we consider only the shufhe-Hamming distance 
5 h- The idea below can be easily generalized to all additive quasi-distances. 

Construction: A Deterministic Metric Lexical Analyzer 

Given: A deterministic lexical analyzer (DLA) A = (Q, E, F^j^ F,t) and an 
integer k > 0. 

Result: A deterministic metric lexical analyzer (DMLA) 

A' = {Q',E,F',j',s',F',t') 



of A with radius k. 

Construction steps: 

i) Construct a nondeterministic lexical analyzer (NLA) 

A" = {Q",E,F",y',s",F",T") 



such that 

Q" = Qx{0,...,k}, 

F" = FU{e\eG F}, 
s" = {s,0), 

F" = {if,^) \fGFki = 0,...,k}, 

{ {q,i) e 7"((p,i),a), for i = 0 , . . . , fc, if g = 7(p,a), 

{q, i -I- 1) G l"{{p, for i < k and bGE,b=^aifq = ^{p, a), 

{q, i -I- 1) G i),a), for all a G E and {q, i) G Q" where i <k, 

(q, J -I- 1) G *), A), for i < k if q = 'j(p, a), for some a G E, 

. / T"iif,0)) = r(/), for / G C 

’ 1 0) = e, for / G T' and i = 1, . . . , fc, if r(/) = e. 

ii) Reduce A” by deleting those states that are not reachable from s” or that 
cannot reach a final state. 

iii) Construct A' using the subset construction method P] such that Q' , 7', s', 
and F' are defined as in a standard subset construction, except that if a 
new state r G Q' contains both {q,i) and (g,j) for some q G Q and i < j 
then delete (g, j) from r; F' C V{F"), and r'(/') = r(/) if (/, 0) G /', for 

some (/, 0) G F", or r'(/') = {r"(/") | /" G /' & /" G F"}, otherwise, 

iv) Simplify A' by merging all the equivalent states (that have the same actions 
if they are final states). 



58 



C.S. Calude, K. Salomaa, and S. Yu 



Note that the above construction uses two copies of the original set of actions 
(N U {e I e G N}) in order to guarantee that the property (M2) holds. 

A nondeterministic metric lexical analyzer A" of A„ with radius 1 is 
constructed following Step i) and Step ii) and shown in Figure 2, where 
r" = {A,B,C,A,B,C} and r"((2,0)) = A, r"((2,l)) = A, r"((4,0)) = B, 
r"((4, 1)) = B, r"((6, 0)) = C, r"((6, 1)) = C. We use [Ai ■ ■■ at] to denote all 
letters in A — {oi, . . . , o*}, and • to denote all letters in E. 




The resulting DMLA A' for A is shown in the following table, where tl, . . 
t6 are terminating states which have no transitions: 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 



c 


1 5 tl X 10 X 


X 


Xt2 


X 


X 


X 


X 16 t5 


X 


X 18 


X 


X 18 18 


d 


17 2 tl X 9 tl 


tl 


tl t2 


X 


X 


X 


X 16 t5 


X tl tl 


tl 


X tl tl 


h 


21 6 t3 t2 8 t2 t2 t2 t2 t2 t2 t2 t2 t4 t4 


X 


X X 


X 


X t2 X 


1 


13 7 tl X 10 X 


X 


Xt2 


X 


X 


X 


X 15 t5 


X 


X 19 


X 


X 19 19 


s 


20 4 3 X 10 12 12 11 t2 


X 


X 


X 


X 14 t5 t5 12 11 12 t5 11 11 




21 5 tl X 10 X 


X 


Xt2 


X 


X 


X 


X t5 t5 


X 


X X 


X 


XXX 




A A A A 


A 


A B A 


B 


C 


A C 


c 


c A 




C 


/ 

T 


B 


B 




B 






C 












C 























r'(tl) = {A}, r'(t2) = {B}, r'(t3) = {A,B}, r'(M) = {B,C}, r'(t5) = {C}, 

T'{t6) = {A,C}. 
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Abstract. The state complexity of basic operations on regular lan- 
guages has been studied in 11)11 Ij . Here we focus on finite languages. 
We show that the catenation of two hnite languages accepted by an m- 
state and an n-state DFA, respectively, with m > n is accepted by a 
DFA of (m — n -I- 3)2"'“^ — 1 states in the two-letter alphabet case, and 
this bound is shown to be reachable. We also show that the tight upper- 
bounds for the number of states of a DFA that accepts the star of an 
n-state finite language is + 2"“^ in the two-letter alphabet case. 
The same bound for reversal is 3 • 2^“^ — 1 when n is even and 2^ — 1 
when n is odd. Results for alphabets of an arbitrary size are also ob- 
tained. These upper-bounds for finite languages are strictly lower than 
the corresponding ones for general regular languages. 



1 Introduction 

Many applications of regular languages use essentially finite languages. In jSJ 
mam, the state complexity of basic operations on regular languages has been 
studied. It is interesting and important to know whether those state-complexity 
results still hold for finite languages. For example, {2m — 1)2"“^ is the number 
of states of a minimal DFA, in the worst case, that accepts the catenation of an 
m-state and an n-state DFA language. Does the catenation of two DFA, each 
accepting a finite language, need the same number of states in the worst case? 
May it be significantly smaller? 

It is known ^ that a minimal DFA that accepts the reversal of an n-state 
DFA language needs 2" states in the worst case. This fact determines that Br- 
zozowski’s DFA minimization algorithm m, which uses two reversals, is expo- 
nential in time and space in the worst case. However, this algorithm is faster 
than other algorithms in many experiments. It is a natural question whether 
this algorithm has a polynomial time or space complexity in the case of finite 

* The work reported here has been supported by the Natural Sciences and Engineering 
Research Council of Canada Grants OGP0041630 and OGP0147224. 



O. Boldt and H. Jiirgensen (Eds.): WIA’99, LNCS 2214, pp. 60-^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



State Complexity of Basic Operations on Finite Languages 



61 



languages. This question is very much related to the state complexity of the 
reversal of finite languages. 

In this paper, we focus on the above mentioned problems and on the state 
complexity of basic operations on finite languages, in general. We show that for 
an n-state DFA A accepting a finite language L, a minimal DFA that accepts 
L* has 2”“^ + 2”“*“^ states in the worst case, where t > 2 is the number of final 
states in A (except the starting state). Note that for t = 1, this bound is simply 
n — 1. 

For the catenations of finite languages, we show that a minimal DFA that 
accepts the catenation of two finite languages, which are accepted by an m-state 
DFA and an n-state DFA, respectively, has at most 




states, where k is the size of the alphabet and t is the number of final states in 
the first automaton. Notice that this bound depends very much on t. If t is a 
constant, then this bound is which is polynomial. In particular, 

when t = 1, it is m-|-n — 2. In the case of a two-letter alphabet (with an arbitrary 
t), this bound is {m — n + 3)2”“^ — 1. We give examples to show that this bound 
is reachable. 

We also show that fc* -I- 2"“^“* is an upper bound on the number of 

states for a minimal DFA that accepts the reversal of a finite language accepted 
by an n-state DFA, where t is the smallest integer such that 2"“^“* < k*. This 
bound is, in the case of a two-letter alphabet, 3 • 2^’“^ — 1 if n = 2p or 2^* — 1 
if n = 2p — 1 . We also give examples to show that the latter bounds are reach- 
able. Unfortunately, these results show that Brzozowski’s DFA minimization 
algorithm is still exponential in the worst case even for finite languages. 

We also consider the state complexity of operations on finite languages in 
the case of a one-letter alphabet. 



2 Preliminaries 

Let T be a finite set. Denote by #T the cardinality of T and by T* the free 
monoid generated by T. The empty word, i.e., the neutral element of T* , is 
denoted by A and T+ = T* — {A}. For w G T* , denote by \w\ the length of w. 
We define 

i i-i 

T^ = {weT*\ |u>| = 1}, = y T\ and = y T\ 

0 i— 0 

If T = {ti, . . . , tk\ is an ordered set, A: > 0, the lexicographical order on T*, 
denoted is defined by: a; y iff a; = j/ or |a;| < |y| or |a;| = |y| and x = ztiV, 
y = ztjU, i < j, for some z,u,v € T* and 1 < t, j < fc. We say that a; is a prefix 
of y, denoted x Ap y A y = xz for some z € T*. The relation is a partial 
order on T*. 
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A deterministic finite automaton (DFA) is a quintuple A = (Q, U, S, qg, F), 
where Q is the finite nonempty set of states; A is the finite nonempty alphabet; 
go € Q is the starting state; F C Q is the set of final states; and S : Q x E — > Q 
is the transition function. We extend S from Q x A to Q x A* by S{q,aw) = 
S{6{q, a), w) and S{q, X) — q for q G Q, a G E, and w G E* . We usually denote S 
by <5 if there is no confusion. 

The language recognized by the automaton A is L{A) = {w G A* | S{qo, w) G 
A}. Two automata are equivalent if they recognize the same language. 

For simplicity, in what follows, we assume that Q = {0, 1, . . . , — 1} and 

qo — 0. We also assume that (5 is a total function, i.e., that the automaton is 
complete. 

Let A = [Q, A, (5, qg, F) be a DFA. Then 

a) a state s is said to be accessible if there exists w G E* such that 5(0, w) = s; 

b) a state s is said to be useful if there exists w G E* such that 6{s, w) G F. 
It is clear that for every DFA A there exists an automaton A' such that 

L{A') = L{A) and every state of A' is accessible and at most one state is useless 
(the sink state). The DFA A' is called a reduced DFA. We will use only reduced 
DFA in the following. 

A DFA A = (A, Q, qg, 5, F) is said to be minimal if for every other automaton 
A' = {E,Q' ,qQ, S' , F') such that L{A) = L(A'), we have < #Q'. 

A minimal DFA has at most one useless state. 

Let L C E* and x,y G E*. Then x =l y H for all z G E* , xz G L iff yz G L. 
Clearly, is an equivalence relation on A*. The number of states in a minimal 
DFA that accepts L is exactly the number of equivalence classes of =l 0. If 
L = L{A) and p, q are states of the DFA A = {E,Q,qg,S,F) we denote also 
p =L q (or simply p = q) ii for all z G E* , 5{p, z) G A iff 5(g, z) G A. 

For basic definitions and results in automata theory, the reader may refer to 

EMU. 

3 Star Operation on Finite Languages 

In Ej (also in ^H), it was shown that for any n-state (complete) DFA A, there 
exists a minimal DFA of at most states that accepts L{A)*. Examples 

were also given to show that this bound is reachable. In this section, we show 
that in the case that A accepts a finite language rather than an infinite regular 
language, the corresponding bound is exactly 2"“^ + 2’^“^. The latter is exactly 
one- fourth of the former. 

Let A be an n-state DFA accepting a finite language. If A has only one final 
state, it is clear that a minimal DFA accepting L{A)* needs at most n — 1 states. 
Note that this is not true in general for an n-state DFA accepting an infinite 
regular language. It has been shown that the upper bound 2”“^ -|- 2"“^ can be 
reached even for n-state DFA with only one final state. 

In the following, we consider DFA with at least two final states. 
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Theorem 1. Let A = (Q, 0, i5, F) be a DFA aeeepting a finite language L, 

where 0 ^ F, = t >2, = n > 4. Then there exists a DFA of at most 

2^-3 _|_ 2 "-t -2 gififgg ffidf accepts L* . 

Proof. We first construct an NFA A' from A by adding a A-transition from each 
final state f € F to 0. Formally, A' = {Q,E,5' ,t),F) where 5' : Q x A7 — )> 2*5 is 
defined for each p G Q and a S If as follows: 

- I { 9 } if 9 = S{P, a) and F, 

1 {9j 0} if g = S{p, a) and q G F. 

Clearly, A! accepts L(A)+. 

Next we construct a DFA B = {Qb,E,6b,0b,Fb) from A' using the stan- 
dard subset-construction method m and, furthermore, make the starting state 
of i? a final state which guarantees that L{B) = L{A)* . Then we have Qb Q 2*^, 
0_B = {0}, Fb = {P G Qb I P n F 0} U {Ob}, and Sb(P, a) = Up^p6'{p, a). 

In the following, we assume that, in A, (n — 1) is the sink state and (n — 2) 
is the final state that has transitions only to (n — 1). Without loss of generality, 
we also assume that B is a reduced DFA. 

Let P G Qb- Then the following three propositions can be easily proved: 

(1) If P n F yf 0, then 0 e P. 

(2) If (n - I) G P, then P =b. P - {n - 1}. 

(3) If (n - 2) G P and P n (F - {n - 2}) y^ 0, then P=l. P-{n- 2}. 

Using the above propositions, we can simplify the DFA B by merging all equiv- 
alent states. Let the resulting DFA be B' = {Qg, E,6 'b,0b,F'b). So, Q'b has at 
most the following states: 

(i) the starting state Ob = {0} and the sink state {n — 1}, 

(ii) all P such that PC (Q — F — (0, n — 1}) and P yf 0, 

(hi) allP= jOjUP'UP" such that P' C (Q-F-{0, n-lj) and P" C F-{n-2} 
and P" yf 0, 

(iv) all P = P' U {0, n — 2} where P' C {Q — F — (0, n — 1}) and P' yf 0. 

Note that in (iv) P' y^ 0 because |0,n — 2} is equivalent to {0} ({0} G F'g), 
which is included in (i). 

Now we calculate the number of states in each of the items above: (i) 2, (ii) 
2 „-t -2 _ (jy) 2"-‘-2(2 ‘-i - 1), and (iv) - 1. 

Hence we have #Qb — 2"“^ + 2"“*“^. □ 

As we have mentioned before, when t = 1, we can construct a DFA of at most 
n — 1 states to accept L* . So, when t = 2 we obtain the maximum number of 
states for the above formula, i.e., 2"“^ + 2"“"'^. 

Note that if 0 G F, then for each P G Qb such that |0,n — 2} C P we 
have P =L» (P — {n — 2}). Thus, all states of (iv) are included in (hi). Then we 
have (i) 2, (ii) 2"-‘-\ and (hi) 2"-*-i2‘-2. The total number is 2"-‘-i -y 2""3. 
However, if t < 2, then we can construct a DFA of at most n — 1 states to accept 
L*. So, this formula reaches its maximum when t = 3, i.e., 2"“^ + 2"“^, which 
is the same as the one in the case 0 ^ F. 
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Corollary 1. Let A = (Q, E, S, 0, F) be a DFA aecepting a finite language L, 
where = n > 4. Then there exists a DFA of at most 2"“^ + 2"“^ states that 
accepts L* . 

Theorem 2. There exists a DFA A = (S,Q,S,0,F) with = n > 4 such 
that any DFA recognizing L{A')* has at least 2"“^ + 2"“'^ states. 

Proof. For an arbitrary integer n > 4, we define a DFA A = {Q, E,6,0, F), 
where Q = {0, 1, . . . ,n — 1}, E = {a, b, c}, F = {n — 3,n — 2}, and S: 

S{i, a) = J + 1, for 0 < i < n — 2, 

S(i, b) = i + 1, for 1 < i < n — 2, and <5(0, b) = n — 2, 

S(i, c) = J + 1, for 0 < i < n — 2 and n — i is odd, 

S(i, c) = n — 1, for 0 < * < n — 2 and n — i is even, 

S{n — 1, a) = n — 1, 5{n — l,b) = n — 1, 5{n — 1, c) = n — 1. 

The DFA A is shown in the figure below in two cases: (a) n is odd and (b) n 
is even. 




Fig. 1. DFA A of n states such that L(A)* needs 2” ® + 2" states 



We construct a DFA A' = {Q', E, 6', O', F') that accepts L{A)* following the 
two steps described in TheoremCJ (i) construct an NFA by adding a A-transition 
from each final state to the starting state; (ii) construct a DFA from the resulting 
NFA of the previous step using the standard subset-construction algorithm. 

In the following it suffices to show that every state specified in Theorem ^ is 
(1) reachable from the starting state {0} and (2) in a distinct equivalence class 
with respect to L{A)*. 

We first prove that every state in the proof of Theorem Q is reachable. For 
convenience, we denote the four disjoint subsets of Q' described in (i), (ii), (iii), 
and (iv) of Theorem □ by Q[u)^ Q[iu)^ Q[iv)^ respectively. In particular, we 
have 
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(?(.)= {{OK {«-!}}> 

Q'iii) = {-P I ^ {1, ■ ■ • n - 4} and P 0}, 

QU) = {P U {0, n - 3} I P C {1, . . . n - 4}}, 

Q[^v) = {P u {0, n - 2} I P C {1, . . . n - 4} and P ^ 0}. 

For Q(j), obviously, the starting state {0} and the sink state {n — 1} are both 
reachable. Now we prove the following claim: 

Claim. Every state q' G Q(jjj) is reachable (from the starting state {0}). 

Let q' G Q[iuy Then g' = P U {0, n — 3} for some PC {1, . . . , n — 4}. We prove 
the claim by induction on the size of P. If #P = 0, then q' = {0, n — 3}. It is clear 
that q' = <5'({0}, Suppose that every state q' is reachable for #P = fc, 0 < 

fc < n — 4. Consider the case when #P = k + 1. Let q' = {0 , . . . ,ik,n — 3}. 
We know that q” = {0, Z 2 — ii, . . . , tfc — ti, n — 3 — Zi, n — 3} is reachable by the 
induction hypothesis. Then it is clear that 

= <5'({0, 1, + 1, . . . , 4 - ii + 1, n - 3 - zi + 1, n - 2}, 

= <J'({0, ii - io,i 2 - io, ■ ■ ■ Ck - io,n - 3 - io,n - 2},a"°) 

= {0,zo,*i,i2, • ■ • - 3} = q'. 

Note that if q' = {0,zo,zi,n — 3}, let q" = {0,n — 3 — ii,n — 3}. Then again 
q' = S'{q " If q' = {0,io,n — 3} {k = 0), let q" = {0,n — 3} and 
q' = S'{q” Therefore, we have proved the claim. 

Note that the claim directly implies that any state P 2 G is reachable since 

for any P 2 = {0, i\, . . . ,ik,n — 2}, where 0 < Zi < . . . < < n — 3, we have 

P 2 = {0, zi — 1, . . . , Zfc — 1, zz — 3} G such that ^'(P^, &) = P 2 . Note that it 
is possible that zi — 1 = 0. 

It is also clear that every state P G Q'yy is reachable since for any such 
state P = {zi, . . . , Zfc}, where 0 < Zi < . . . < Zfc < rz — 3, we have P' = {0, Z 2 — 
Zi, . . . , Zfc — Zi, 7z — 2} such that S'{P', 0 *^) = P. So, we have proved that every 
state specified in Theorem Q] is reachable from {0}. 

Now, we prove that every state above is in a distinct equivalence class of 

= L*- 

It is clear that if two states p and q are from different sets of Q[iy Q[u)^ 
Q'{iii)i Q'{iv)i thsn P ^ Q (with respect to L*). It suffices in the remaining to 
prove that if there exists zG {I,...,n — 4} such that i G p—q, then p ^ q.li n — i 
is odd, then both 5'(p, and S'{p,ca"'~^~^) are final, but <5'(g, 

and S'{q, 00 "“*“^) cannot be final at the same time. If zz — z is even and z < zz — 4, 
then both S'(p,aca^~''~^) and (5'(p, aca”“*“^) are final, but S' {q, aca'^~'‘~^) and 
S'{q, cannot be both final. If z = zz — 4, then S'{p, a) G F' but S'{q, a) ^ 

P'. Therefore, p^ q. □ 

We do not yet have an example for the two-letter alphabet case. It is still 
open whether there exists a lower upper bound for the two-letter alphabet case. 



66 



C. Campeanu et al. 



4 Catenation of Finite Langnages 

We now consider the state complexity of the catenation of two finite languages. 

Without loss of generality, we assume that all the DFA we are considering 
are reduced and ordered. A DFA A = (Q, S, S, 0, F) with Q = {0, 1, . . . , n} is 
called an ordered DFA if, for any p,q € Q, the condition S{p, a) = q implies that 
p<q- 

For convenience, we introduce the following notation: 



Theorem 3. Let Ai = {Qi, E,Si,0, Fi) , i = 1,2, be two DFA accepting finite 
languages Li, i = 1,2, respectively, and = to, #<52 = n, fi^E = k, and 

=fiFi = t. There exists a DFA A = {Q, E, 5, s, F) such that L{A) = L{Ai)L{A 2 ) 
and 



#Q< 



m-2 



k\ 



n — 2 
< i 



n — 2 
<t-l 



n — 2 
< t 






(*) 



Proof. The DFA A is constructed in two steps. First, an NFA A' is constructed 
from Al and A 2 by adding a A-transition from each final state in Fi to the 
starting state 0 of A 2 . Then, we construct a DFA A from the NFA A' by the 
standard subset construction. Again, we assume that A is reduced and ordered. 

It is clear that we can view each q G Q as a, pair (qi,P 2 ), where qi G Qi 
and P 2 Q Q 2 - The starting state of A is s = (0, 0) if 0 ^ Fi and s = (0, {0}) if 

0 € Fi. Let us consider all states q G Q such that q = {i, P) for a particular state 

1 G Qi — {to — 1} and some set P C Q 2 . Since Ai is ordered and acyclic, the 
number of such states in Q is restricted by the following three bounds: (1) F, 



( 2 ) 



n — 2 
< i 



and (3) 



n — 2 
<t-l 



We explain these bounds below informally. 



We have (1) as a bound since all states of the form q = {i, P) are at a level 
< i, which have at most predecessors. By saying that a state p is at level i 
we mean that the length of the longest path from the starting state to q is i. 

We now consider (2). Notice that if q,q' G Q such that S(q,a) = q', q = 
( 91 ,^ 2 ) and q' = {q'i,P 2 ), then Si{qi,a) = q[ and P^ = {S 2 {p,a) \ p G P 2 } if 
q'l ^ Fi and P 2 = {0} U {S 2 {p, a) \ p G P 2 } if q'l G Fi. So, #P 2 > #P 2 is possible 
only when q[ G F\. Therefore, for q = {i, P), ffP < i \i i ^ F\ and ffP < z + 1 

if z G Fi. In both cases, the maximum number of distinct sets P is (rfl 



The number n — 2 comes from the exclusion of the sink state rz — 1 and starting 
state 0 of A 2 . Note that, for a fixed z, either 0 G F for all (z, F) G Q or 0 is not 
in any set F such that (z, F) G Q. 

(3) is a bound since for each state z G Qi — (to — 1}, there are at most t—1 
final states on the path from the starting state to z (not including z). 

For the second term of (*), it suffices to explain that for each (to — 1,F), 
P C Q 2 , #F is bounded by the total number of final states in Fi. □ 
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Corollary 2. Let Ai = {Qi, S,Si,0, Fi) , i = 1,2, be two DFA aecepting finite 
languages Li, i = 1,2, respeetively , and = m, #Q 2 = n, and fj^Fi = 

t, where t > Q is a eonstant. Then there exists a DFA A = (Q, E,S, s, F) of 
0{mn*~^ + n‘) states such that L{A) = L{Ai)L{A 2 ) ■ 

We can simplify the formula in Theorem 0 for the case when fc = 2, m + 1 > 
n > 2. 

Corollary 3. For k = 2 and m+1 > n > 2, the upper bound given in Theorem^ 
is 

(m — n + 3)2”“^ — 1. 

We omit the details of the mathematical calculation. 

Theorem 4. The upperbound given in Corollary^is reachable. 

Proof. Let Ai = (Qi, E, Si, 0, Fi) and A 2 = (Q 2 , E, S 2 , 0, F 2 ), with E = {a,b}, 
Qi = {0, 1, . . . , TO — 1}, Q 2 = {0, 1, . . . , n — 1}, and to + 1 > n > 2. Ai and A 2 
are shown below. 





^2 



Let L = L{Ai)L{A 2 ). We show that there are at least (to — n + 3)2"“^ — 1 
equivalence classes of the relation over E*. 

Consider all words w G E* such that |w| < to — 2. 

If wi,W 2 G E* , |uii|,|u> 2 | < TO — 2, and |wi| < |w 2 |, then w\^i^W 2 since 
e L{A) but ^ 

Let |wi| = |r<; 2 | but w\ W 2 and wi and u >2 differ at the tth position from 
the right, i < n — 2. We assume that wi contains an a and W 2 contains a 6 at 
that position. Then wi^^W 2 since ^ L but G L. 

So, for each k, 0 < k < n — 2, words of length k belong to 2^ distinct 
equivalence classes of =l. For each k, n — 2<k<m — 2, words of length k 
belong to at least 2"“^ distinct equivalence classes. 

Therefore there are at least 

l + 2 + ... + 2* + ... + 2”-^ + 2”-^ + . . . + 2"-2 

V J 

TO — 2 — (n — 2) + l terms 
= 2"-4-l + (TO-n + l)2"-2 
= (to — n + 3)2”“^ — 1 



equivalence classes of =l. 



□ 
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5 Reversal of Finite Languages 

Next we develop a tight upper bound for the state complexity of the reversal of 
a finite language. 

Theorem 5. Let A = (Q, E, S, 0, F) be a DFA aecepting a finite language L, 
where = n > 3 and = k > 2. Let t be the smallest integer such that 
< k*. Then there exists a DFA B = {Qb, E,Sb,0,Fb), with ^Qb < 
that accepts L^, i.e., the reversal of L. 

Proof. B is constructed by first reversing all the transitions of A and then de- 
terminizing the resulting NFA by the standard subset construction. Then each 
state in Qb is a subset of Q. Recall that the level of a state in a finite automa- 
ton is the length of the shortest path from the starting state to this state. It 
is clear that the number of states at each level i of B is bounded by F. It is 
also not difficult to see that this number is bounded also by since they 

are subsets of at most n — 1 — i states of A. Let I be the length of the longest 
word(s) in L (or L^). The latter bound holds because for each i, 0 < i < I, there 
exists at least one state of A that can be in a state of B of level i but not in 
any state of a higher level. Then the number of states at each level i is bounded 
by minjfc®, 2"“^“®}. Since t is the smallest integer such that 2®®“^“* < we 
have + 2"“^“*. Note that 2®®“^“* is the number of all remaining 

subsets of Q after the first t — 1 levels. □ 

Corollary 4. Let |27| = 2 and A be a DFA of n > 3 states, accepting a finite 
language L C E* . Then there exists a DFA B that accepts L^ such that B has 
at most 3 • 2^~^ — 1 states if n = 2p or 2^ — 1 states if n = 2p — 1. 

Proof. Since k = 2, we have 2®®“^“* < 2*, i.e. n — 1 < 2t. If n = 2p then t = p 
and n— 1 — t = 2p— 1— p = p— 1. We have 



t-i 

^ 2® -b 2®®"^"* = 2* - I -b 2P-^ = 3 • 2P-^ - 1. 

i=0 

If n = 2p — I then t = p — 1 and n— 1 — t = 2p— 1 — 1— p+l=p — 1. We have 

t-i 

^ 2® -b 2®®"^"* = 2P-^ - I -b 2^-1 = 2^-1. 

i=0 

□ 



Theorem 6. The bounds given by Corollary^are reachable. 

Proof. If n = 2p for some integer p > I, consider the DFA A = {Q, E, 6, 0, F) in 
the above figure. 

Clearly, the reversal of A is equivalent to the catenation of Ai and A 2 given 
in Theorem El with m = n = p+ 1. Then any DFA accepting L{A)^ has at least 
2 P -1 2P - 1 = 3 • 2P-1 - 1 states. 

If n = 2p — 1 for some integer p > 1, then look at the DFA A' = 
(Q', E, S', 0, F') below: 
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a, b 




A': n = 2p-l 



The reversal of A' is equivalent to the catenation of Ai and A 2 given in 
Theorem 0 with m = p and n = p + 1. Thus, the number of states is at least 
2P -1. □ 



6 Operations on Finite Langnages over a One-Letter 
Alphabet 

We consider the case when = 1. Without loss of generality, we assume that 
r={a}. 

Notice that if A = (Q, {a}, 0, 5, F) is a minimal DFA that accepts words of 
length at most I, then = I + 1. 

Theorem 7. Let Ai = {Qi,{a},0,Si, Fi), i = 1,2 be two minimal DFA, with 
=ffL{Ai) < 00 , = m, and #Q 2 = n- Let A = {Q, {o}, 0, 6, F), = k, be a 

minimal DFA. Then we have the following: 

a) If L{A) = L{Ai) U L{A 2 ), then k = max{m, n}, 

b) If L{A) = L{Ai) n L{A 2 ), then k < min{77i, n}, 

c) If L{A) = L{Ai) — L{A 2 ), then k <m, 

d) If L{A) = L{Ai)AL{A 2 ), then k < max{m,n}, 

e) If L{A) = {a}* — L{Ai), then k = m, 

f) If L{A) = L{Ai)L{A 2 ), then k = m + n — 1. 

g) If L{A) = L{Ai)* , then k < — 7m + 13 for m > 4 and m = 3, fc < 2 

otherwise. 

h) If L(A) = a \ L{Ai), then k = m — 1. 

i) If L{A) = (L(Ai)^, then k = m. 

Proof. For a)-f) and h) the proof is obvious. For g), we give an informal proof 
in the following. It is clear that the length of the longest word accepted by Ai is 
m — 2. We consider the following three cases (1) Ai has one final state; (2) Ai 
has two final states; or (3) Ai has three or more final states. If (1), then A has 
m — 1 states. For (2), we need a lemma (Lemma 5.1 (iii)) from [01 which says 
that for two positive integers i and j, (i,j) = 1, the largest integer that cannot 
be presented as ci + dj for any integers c, d > 0 is i * j — {i + j) . Let i = m — 2 
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and j = m — 3, i.e., Fi = {m — 2,m — 3}. Then the length of the longest word 
that is not in L{A) is 

(m — 2)(m — 3) — (2m — 5) = m^ — 7m + 11. 

Then A has exactly m^ — 7m + 13 states. If (3), it is easy to see that A cannot 
have more than m^ — 7m + 13 states. □ 

Remark 1. All the above bounds are the lowest upper bounds in the worst case. 
If the initial DFA Ax and A 2 are not minimal, all the above equalities become 
inequalities. 
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Abstract. Words not present in the dictionary are almost always fonnd 
in nnrestricted texts. However, there is a need to obtain their likely base 
forms (in lemmatization) , or morphological categories (in tagging), or 
both. Some of them find their ways into dictionaries, and it would be 
nice to predict what their entries should look like. Humans can per- 
form those tasks using endings of words (sometimes prefixes and infixes 
as well), and so can do computers. Previous approaches used manually 
constructed lists of endings and associated information. Brill proposed 
transformation-based learning from corpora, and Mikheev used Brill’s 
approach on data for a morphological lexicon. However, both Brill’s al- 
gorithm, and Mikheev’s algorithm that is derived from Brill’s one, lack 
speed, both in the rule acquisition phase, and in the rule application 
phase. Their algorithms handle only the case of tagging, although an ex- 
tension to other tasks seems possible. We propose a very fast finite-state 
method that handles all of the tasks described above, and that achieves 
similar quality of guessing. 



1 Introduction 

A morphological analysis of words in a text is needed in many applications. 
It constitutes a prerequisite for natural language parsing and all applications 
that use it, and it is also useful in document retrieval. Such analysis is usually 
lexicon-based, i.e. it requires a morphological lexicon. 

Unfortunately, real-world texts contain correct words that cannot be found 
in a lexicon. It seems impossible to record all words of a living language in a 
lexicon, as a lexicon is static in nature, and a language is a living thing - new 
words are coined continually. Another reason for finding words not present in 
the lexicon is the Zipf’s law[IJ. The Zipf’s law states that the rank of an element 
divided by the frequency of occurrence is constant. E.g. in the Brown corpus, 
two percent of different word^ account for sixty nine percent of the text. About 
seventy five percent of different words occur five or fewer times in the corpus. 
Fifty eight percent of different words occur two or fewer times, and forty four 
percent only occur once. The consequence of the Zipf’s law is that by doubling 
the number of words in the lexicon, one gets only a few percents of the coverage 

^ Eric Brill uses the terms word type and token 



O. Boldt and H. Jiirgensen (Eds.): WIA’99, LNCS 2214, pp. 71-^21 2001. 
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of an arbitrary unrestricted text. Therefore, increasing the size of the lexicon is 
a very costly effort yielding minute results. 

New words are also constructed by derivation or compounding. The number 
of potential words formed in that way is huge. Therefore, it is not practical to 
store all such derivatives and compounds in the lexicon. In many cases there 
may be many ways to form a new word, and it is not possible to predict which 
one would be chosen by humans as correct. 

Additionally, texts may contain incorrect words. For purpose of e.g. spelling 
corrections, the morphological categories of a misspelled word may help reduce 
the list of possible corrections. If the misspelling does not affect the word’s 
flectional ending (and its prefix, if present), these categories may still be easily 
obtainable from the corrupted version. 

Some of previously unseen words eventually make their way into the (mor- 
phological) lexicon. In order to do that, we need to give them morphological 
descriptions. The task of writing them by hand is laborious; it would be much 
easier to choose a description from a list of feasible descriptions. 

Humans can perform all those tasks quite well. The reason they can do that 
is that they can associate information they want to extract with endings of 
words (sometimes also with their prefixes and infixes). So can do computers. We 
propose a fast finite-state technique to accomplish that. 

2 Related Work 

In the past, various hand-crafted heuristics have been used for the purpose of 
morphological analysis of unknown words. Later, they have been supplemented 
by statistical techniques (e.g. [Zj). However, although the probabilities of different 
endings leading to their corresponding categories were calculated, the endings 
themselves were chosen manually. A revolutionary approach was proposed by 
Eric Brill (|P, 0). The endings, as well as prefixes, are found by the program. 
Unknown words are first tagged by a naive initial-state annotator that assigns 
the tags proper noun or common noun on the bases of their capitalization. Then 
five types of transformations are applied: 

Change the tag of an unknown word (from X) to Y if: 

1. Deleting the prefix (suffix) x, |x| <4, results in a word (x is any string of 

length 1 to 4). 

2. The first (last) (1,2, 3,4) characters of the word are are x. 

3. Adding the character string x as a prefix (suffix) results in a word (|x| < 4). 

4. Word w ever appears immediately to the left (right) of the word. 

5. character z appears in the word. 

The result is compared with a tagged corpus. The best-scoring rule is chosen, 
and applied to the corpus so that it becomes new input data. The learning 
stops when no rule can increase the score. The transformation type 4 takes into 
account the context of the unknown word. When the morphological analysis is 
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separated from a contextual tagger, as it is the case with our approach, the tagger 
must find those rules itself. The transformation types 1 and 3 checks whether 
adding or deleting characters from/to a word results in another word. In our 
algorithm those transformations are not present, as well as transformation type 
5, which can be treated, if necessary, as a supplementary heuristics. The rules 
schemata presented above do not account for infixes, which can be treated with 
our method. 

Andrei Mikheev (| 3 |) applied Brill’s transformations to the data for a (pre- 
existing) morphological lexicon. He uses the following template: 

G =x:{b,e} [—5' -I- M ?/-class — >■ i?-class], 

where: 

— X indicates whether the rule is applied to the beginning or end of a word 
and has two possible values: b — beginning, e - end; 

— S' is the affix to be segmented; it is deleted (-) from the beginning or end of 
an unknown word according to the value of x; 

— M is the mutative segment (possibly empty), which should be added (-I-) to 
the result string after segmentation; 

— j-class is the required POS-class (set of one or more POS-tags) of the stem; 
the result string after the -S and +M operations should be checked (?) in 
the lexicon for having this particular POS-class; if /-class is set to be “void” 
no checking is required; 

— i?-class is the POS-class to assign (— >■) to the unknown word if all the above 
operations {-S +R ?/) have been successful 

Compared to Brill’s algorithm, Mikheev checks also (optionally) the mor- 
phological class (a set of categories) of the resulting word in transformation 
templates 1 and 3. Also, his algorithm returns all categories of an unknown 
word, not only the most probable one. 

Although Mikheev is not aware of that fact0, the rules learnt by Brill’s algo- 
rithm can be transformed into a finite-state automaton. However, that process 
is complex and time-consuming. The learning process takes time as well. Our 
algorithm produces the FSA directly from data exploring the links that occur 
naturally in the format of data we use. In Brill’s approach, copied by Mikheev, 
the length of suffixes and prefixes is a constant. Increasing it means much more 
computation. In our method, the suffixes are discovered naturally, so there is no 
need to limit their lengths. 

It should be stressed that both Brill and Mikheev guess only categories (tags), 
and not base forms or morphological descriptions. Jan Tokarski (see jOj, in Polish) 
prepared data for guessing not only categories, but the base forms and morpho- 
logical descriptions as well. However, he did that enormous work by hand. The 
result is a book of a few hundred pages. It has been used in creation of several 
morphological dictionaries of Polish. In at least one implementation, the contents 
of the book has been converted into a finite-state automaton. 



^ [SI does not appear among references in 0 
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3 Finite-State Approach 

Our approach is based on the observation that it is possible to associate endings 
with required information by inverting inflected words, and sticking the required 
information (an annotation) at the end. By performing that operation on all 
inflected forms in the lexicon, we get a finite set of strings. Therefore, it is possible 
to convert it into a minimal, deterministic, acyclic, finite-state automaton that 
we call a guessing automaton. To And appropriate annotation for unknown word, 
we need to invert the word, and then look for it in the guessing automaton. 



3.1 Data 

The exact format of the annotation depends on the information we want to put 
into it. In a general case, we need a special symbol that we call an annotation 
separator to separate annotations from the inflected forms. For reasons that we 
explain later, we also need another special symbol to mark the end of the inverted 
inflected form; we put that symbol in front of the annotation separator. If the 
annotation should consist of morphological categories of the inflected form, we 
simply put them after the annotation separator. Example: 

abmob_+Verb [mode=ind tense=past num=sg person=3] 

If the annotation is to be the base form, a little bit of coding is necessary 
in order to avoid inflating the automaton. We assume that it is only the ending 
that is different in the base form as compared with the inflected form. Therefore, 
we can replace the full base form with a code that says how many characters 
are to be deleted from the end of the inflected form, and a string consisting of 
characters that are to be appended to obtain the base form. When no characters 
are to be deleted, we put ’A’ there, one character - ’B’, etc. Example: 
abmob_+Ber 

It is possible to put both the base form, and the categories in annotations. 
Example: 

abmob_+Ber+Verb [mode=ind tense=past num=sg person=3] 

Annotations for morphological data acquisition are more complicated, as the 
base form may be different form the lexical form, and the lexical form may con- 
tain arch-phonemes. Also, they depend on the particular morphology program 
we use. Therefore, we will not give any examples of that here. The annotation 
formats we present here are very simple; however, by modifying the annotation 
format we can successfully handle prefixes and infixes as well. 

3.2 Pruning 

As word endings normally decide what annotation should be associated with 
a given word, the automaton has a particular structure. For a given word, the 
first few states have many outgoing transitions. Then, there is a chain of states 
linked with each other with single transitions. Passing to annotations, a more 
complicated transition network appears again. As for any state in the central 
part, there is only one way leading from the state to the appropriate annotations, 
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it represents no useful information for our purpose. So all states from that part 
can be pruned, along with the corresponding transitions. Pruning explains the 
need for the special symbol marking the end of the inverted inflected form (the 
beginning of the inflected form) . We need it because we no longer have full words 
in the automaton, sometimes entire shorter words may constitute the end part 
of longer words, and different words may have different annotations, so we need 
to distinguish them. 

Pruning is governed by the following rules: 

Rule R 1. The pruning process does not apply to transitions belonging to an- 
notations. 

This rule should be obvious, because annotations are what we want to ob- 
tain in the recognition phase. The transitions that are pruned belong to the 
inflected forms, and more precisely: to their beginnings that do not influence the 
annotations. 

Rule R 2. A transition can be removed only if the pruning process has already 
visited all transitions that can be reached through the target state, except the 
transitions representing annotations. In other words, the states are visited (and 
the outgoing transitions pruned if possible) in the postorder method. 

This means that we traverse the automaton recursively in depth, cutting 
unneeded transitions on the way back. 

Rule R 3. A transition cannot be removed if the target state has a transition 
that does not belong to annotations, but cannot be removed. 

This means that the target state has transitions that distinguish between 
different annotations, i.e. they lead to different sets of annotations. We do not 
want to lose that distinction. 

Rule R 4. A state can always be replaced with an equivalent one (i.e. a state 
with the same transitions leading to the same states). 

The automaton should be kept minimal. 

Rule R 5. A state with all transitions leading to one state (the target state) 
can be removed with all transitions, and transitions that point to it should be 
replaced by transitions pointing to the target state. 

This is the rule that actually cuts transitions. Note that it also applies to 
states with one only transition. 

The consequence of the rules R2 and R5 is that we cannot remove states 
(and transitions that lead to them) if they have transitions leading to different 
sets of annotations. This is because the rule R2 ensures that all transitions of 
a visited state lead to states that either cannot be removed, or that have an 
outgoing transition labeled with the annotation separator. 
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The process described above leads to the construction of a finite state au- 
tomaton that contains all required information. However, the automaton is still 
big, and an effort can be made to further reduce its size. Looking at its contents, 
we can see many states with a majority of transitions leading to one state (we 
will use the term default state ), but with other transitions as well. The target 
state must have one outgoing transition, and it must be labeled with the anno- 
tation separator. We can treat less frequent transitions as exceptions, assuming 
that all other transitions, even those that had not appeared in our lexicon, lead 
to the default state. Acting under this assumption, we can replace the frequent 
transitions and the default state with the transition leading from the default 
state. A limit can be imposed on the ratio of frequent/less frequent transitions 
to trigger the pruning. 

There is a difference between rules Rl, R2, . . . R5, and the rule we are about 
to introduce (R6). Rule R6 introduces a generalization. While still 100% of words 
present in the lexicon are annotated correctly, R6 may select one annotation 
among many as the correct one, and hide other possibilities. This may speed up 
an annotation process, but it can also introduce errors: some correct (but less 
probable) possibilities may not be shown, as the lexicon may not contain data 
that associates their endings with their correct annotations. 

Rule R 6. If for a given state the number of outgoing transitions leading to one 
state (the default state) is greater or equal to the number of all other outgoing 
transitions multiplied by a small integer, and the default state has only one 
outgoing transition and it is labeled with annotation separator, then the default 
state can be removed, and all transitions that lead to it should be replaced by 
the transition leaving the default state. 

Sometimes, it is impossible to devise a rule that associates an ending with 
the correct annotation, because the choice is lexicalized, i.e. it depends on a 
particular word, and it seems arbitrary from the morphological point of view. 
For example, in Polish, there is a rule that transforms adjectival endings -sny in 
base forms into -sniejszy in comparatives and superlatives. There is, however, 
another rule that transforms endings -sny into -sniejszy in comparatives and 
superlatives. So there is no other way of knowing what the base form might be 
from a comparative or superlative ending other than a dictionary lookup. R6 
introduces artificial divisions, e.g.: 

-rasniejszy — >■ -rasny 
-masniejszy — >■ -masny 
-wasniejszy — >■ -wasny 
-osniejszy — >■ -osny 
-blesniejszy — >■ -blesny 
-usniejszy — >■ -usny 

while the right answer is that both annotations must be considered: 



jasniejszy — >■ jasny 
-iasniejszy — >■ -iasny 
-dosniejszy — >■ -dosny 
-zesniejszy — >■ -zesny 
-olesniejszy — >■ -olesny 
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-sniejszy — >■ -sny 
-sniejszy — >■ -sny 

To cope with that situation, we introduce a new rule that strives to accom- 
modate such cases. We will use the term first annotated state to name a state 
that is a target of a transition labeled with the annotation separator (a state 
that begins an annotation or a set of annotations). 

Rule R 7. If for a given state the number of first annotated states that are 
reachable from the given state does not exceed a given limit, then: 

— replace the first annotated states by their union; 

— replace all the states and transitions between the chosen state and the union 
of the first annotated states by a single transition labeled with the annotation 
separator. 

Note that it is possible to introduce a lower limit on the number of states to 
be removed in order to insure that we are dealing with a case such that the one 
described above {sny and sny). The rule can then work in parallel with R6. 

It is worth noting that while the rule R6 introduces very detailed distinctions, 
R7 discards details. For the guesser, the result of applying R7 is that one gets 
more choices than without having applied R6 or R7. As to the lexicon size, R7 
removes small differences between similar word forms, making it possible to infer 
more general and compact relations between endings and annotations. 

Please note that although no annotation possibility is lost, and the automaton 
is much smaller, the answers for known words are no longer 100% accurate. The 
correct answer appears always, but it may be accompanied by other, incorrect 
possibilities. In many cases exceptions are merged with regular rules. A lower 
limit imposed on the number of states to be removed by this rule can solve the 
problem. 

3.3 Recognition 

It is mostly endings that decide what annotations a word may have. To get an 
annotation, we invert the unknown word, put the special symbol (word beginning 
marker) at the end, and look for such string in the automaton. Sooner or later, we 
come to a state that has no transition labeled with the subsequent letter of the 
string. That state may have a transition labeled with the annotation separator. 
If it has one, then the right language of that state is the set of annotations for 
the word. If not, we recursively look at the descendants of the state, searching 
for states with transitions labeled with the annotation separator, and adding 
the right languages of the states they lead to to the resulting annotation of the 
unknown word. 

When we need to code prefixes or infixes (e.g. for German), the annotations 
contain information on what the prefix should be, and on what should be done 
at the beginning of the word to obtain the base form. In that case, the prefix 
stored in the automaton must be compared with the prefix of the analysed word. 
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Table 1. Recall and precision for predictions of properties of unknown words 



categories only 


categories and base forms 


R1-R5 R1-R6 R1-R5,R7 R1-R5 R1-R6 


R1-R5,R7 


recall 94.57 93.92 97.80 

precision 90.03 91.64 64.81 


93.43 92.74 
88.45 90.34 


95.93 

61.42 



4 Results 

Experiments were carried out on French morphology from ISSCO, Geneva, 
Switzerland. The data for the morphological lexicon was divided in 10 parts, 
9/10 were used to construct guessing automata, and 1/10 as a source of words 
whose annotations were to be guessed. This was done ten times, i.e. for each pair 
(9/10, 1/10). Standard measures of recall, precision, and coverage were used to 
evaluate the results. Recall and precision were calculated for each item, and the 
average of all items was calculated for each part. 

Table IDshows recall and precision for prediction of morphological categories, 
and both morphological categories and base forms. The coverage is 100% in all 
cases. For the rule R6, we chose that there should be twice as many transitions 
leading to the default state than to other states. For the rule R7, we merged 
only two states at a time, they had to be the only children of a given state, and 
we did not count the transitions that led to them. By setting a threshold on the 
minimum number of transitions leading to the states that are to be merged by 
the rule R7, we can raise precision, but lower the coverage (e.g. to 95.99% and 
75.39% for categories only, and to 94.48% and 74.10% for both categories and 
base forms). 

The results show that if we want to minimize the number of choices and raise 
precision, we should use rules R1-R6. This is the case of simple POS-tagging. In 
cases where we want to make sure that we do not miss any possibility, we should 
use rules R1-R5 and R7. 

Mikheev (0) claims achieving 95.24% recall, 85.16% precision, and 92.66% 
coverage on categories only (he did not consider base forms). Note that we 
have 100% coverage, so his recall and precision should probably be multiplied 
by 0.9266 in order to be compared directly to ours. However, he performed 
experiments on English words, and it is not clear what the impact of the chosen 
language is on the results. He also used smoothing on a corpus. Since we have 
access neither to his programs, nor to the data he used, the only way we could 
compare our results to his would be to emulate his approach on our data. Our 
experiments required ca. 14 minutes to build all 10 guessing automata for one 
set of rules, and 2 minutes 15 seconds to evaluate the results on Pentium II 
350 MHz with 128 MB memory running under Linux. Using Mikheev’s method 
would probably take days. 

Table 0 shows the results for guessing morphological descriptions of words. 
The coverage is no longer 100%; this is probably caused by existence of arch- 
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Table 2. Recall, precision, and coverage for predictions of morphological descriptions 
of unknown words 



R1-R5 R1-R6 R1-R5,R7 


recall 


93.22 


92.47 


94.67 


precision 


86.26 


88.27 


66.31 


coverage 


99.94 


99.96 


99.99 



Table 3. Size of the automaton as function of rules 



categories only 


morphological descriptions 


R1-R5 R1-R6 R1-R5,R7 R1-R5 R1-R6 


R1-R5,R7 


states 19800 18507 10123 

transitions 65635 48419 32729 


39832 38301 
99535 78976 


27608 

61077 



phonemes. The format of descriptions was the one used by mmorph tool (0) 
from ISSCO, Geneva. 

The choice of rules influences the size of the automaton. Table 0 shows that 
relation for guessing only categories, and for guessing morphological descriptions. 
We can see that R6 cuts mostly transitions, but R7 cuts both transitions and 
states. 

5 Conclusions 

We have presented a novel finite-state approach to morphological analysis on 
words that are not present in the lexicon. Its main adavantage is speed, both in 
the rule acquisition phase, and in the rule application phase. Our method can also 
be used for lemmatization, and for acquisition of new words for a morphological 
lexicon. 

All programs and data used in our experiments are publically available. 
The finite-state guesser is available form http://www.pg.gda.pl/~jandac/ 
fsa.html as part of the fsa package, while mmorph and MULTEXT French 
morphology (written by Pierrette Bouillon) is available from ISSCO, Geneva, 
Switzerland, at http://www.issco.unige.ch/. 
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Abstract. Finite antomata are being used to encode images. Applica- 
tions of this technique include image compression, and extraction of self 
similarity information and Hausdorff dimension of the encoded image. 
Jiirgensen and Staiger |7| proposed a method by which the local Haus- 
dorff dimension of the encoded image could be effectively computed. This 
paper describes the first implementation of this procedure and presents 
some experimental results showing local entropy maps computed from 
images represented by finite automata. 



1 Introduction 

Local entropy (Hausdorff dimension) measures of images are of interest because 
the local entropy of an image is closely related to local texture. If we can map 
texture, it is possible to map the boundaries of objects in the image, or to 
simply map relative texture differences over the image. Image texture mapping 
techniques have many applications in remote sensing, an example of which can 
be found in 0. A general survey of texture analysis techniques for various ap- 
plications is given in m and 0. 

A method of computing the global Hausdorff dimension of a two color (bi- 
nary) image from the finite automaton representation was proposed in [I I ) . This 
measure is not appropriate for inferring local texture information, since such a 
measure corresponds only to the most disorderly point in the image. This idea 
was refined, and a measure of local entropy, as well as an effective method of 
computing local entropy, was defined in jTj. This paper introduces the first im- 
plementation of the afforementioned method and the first image entropy maps 
computed using this method. 

In the sequel, we first briefly explain the relationship between languages, au- 
tomata, and images. We then proceed to review some of the theory behind the 
localization of Hausdorff dimension, the proposed procedure for computation of 
the local entropies from the automaton, and then discuss the actual implemen- 
tation and software, and show some example entropy maps. 
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2 Languages, Automata, and Images 

Consider a binary image given by a finite quadtree. Suppose each edge of the 
tree is labeled with a number from the set S = {0,1) 2, 3} representing the 
corresponding quadrant. The labels along a path from the root of the tree to a leaf 
node form a word over the alphabet E. Each such word can be thought of as the 
quadtree address of a pixel that is turned “on” . All such words together form a 
finite regular language and we can thus construct the corresponding automaton. 
If a word is accepted by the automaton, then the pixel at the corresponding 
address is turned “on” . 

Suppose an image is given as an infinite quadtree. The quadtree addresses of 
image points that are turned “on” form a oj-language M C S‘^ . If the language 
M is regular we can then construct a Biichi automaton ca A= {Q,qo, A, F) 
which accepts M. If we then treat A as a classical automaton with F = Q and 
run all finite words of length n on A we can render the original image at a finite 
resolution of 2" x 2" pixels. 

A detailed and more generalized description of how to encode greyscale im- 
ages as finite automata is given in |2| and p. 

3 Local Hausdorff Dimension 

In this section we review how the local Hausdorff dimension measure is obtained 
from the global Hausdorff dimension, as described in p. 

Let M be a subset of E°° = A* U E‘^ . Let pref„M be the set of prefixes of 
length n of M. The language 

= (Cl ^ e A“, e M} (1) 

is called a state of M. M is finite-state if it has finitely many states. The entropy 
of M is denoted by FIm (which can be computed from the structure function) and 
the Hausdorff dimension of M is denote dimM. It is well known that dimM < 
FIm for every M C . In fact, for finite-state and closed w-languages, it is true 
that dimM = FIm- The following theorem shows how to calculate the global 
entropy, and hence the global Hausdorff dimension of M . 

Theorem 1. If M is a finite-state closed oj-language, then Hm = log|j;| a 
where a is the maximum eigenvalue of the adjacency matrix Um of M. 

Since dimM is constructed from a metric space, it’s localization begins with 
defining local size measures. Let S = (X, g) be any compact metric space, and 
let A be a real valued function on S which satisfies: 



A(M) > 0 for M 0 


(2) 


A(X) < 00 


(3) 


A is monotone 


(4) 
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A function A with these properties is called a size measure. A could be any 
of entropy (H), subword complexity (r), program complexity (k) or Hausdorff 
dimension (dim). These measures are localized by applying the localization op- 
erator For M C X , X € X , and e > 0, let 

l^^\x,e) = X{K,{x)nM) (5) 

where K^{x) is the open ball of radius e around point x, or, all points x' such 
that q{x, x') < e. 

Note that if M C M' and K^{x) C K^:{x') then l^j^\x,e) < l^^]{x',e'). This 
property causes the following definition to make sense: 

Definition 1. For M C X and x G X, 

lM\x) = lirnl’-M\x,e) ( 6 ) 

€^0 

is the local {M , \)~ size-measure at x. 

It turns out that, as in the non-local case, several important local measures 
coincide if M is finite-state and closed. 

Theorem 2. j2| Let M C he finite-state and closed. Then 

(6 = iO = (0 = iO ( 7 ) 



for allfG 

The fact that local entropy and local Hausdorff dimension coincde for such 
languages makes it possible to compute the Hausdorff dimension of the states of 
M by computing the entropy of the states of M. 

Theorem 3. [Z| Let M be a finite-state closed uj-language and let f be an co- 
word. Lf (prefj^)I“^lM is empty for some z S IN then l\^\f.) — 0. Otherwise, 
there is a state occuring infinitely often in the sequence ((prefj^)t“^lM)^^j^ 

and Im\0 = Hm^- 

Computing the local entropy at and accordingly at the point of the image 
addressed by quadtree address ^ if M C {0,1, 2,3}“, can now be performed by 
two steps: 

1. Find a state which occurs infinitely often among the states 

((pref,e)[-ilM).gjj^. 

2. Compute Hm^- 

If f is of the form uv^ for some u,v G E* then step 1 can be performed in 
finite time. The next section discusses the proposed method for performing step 

2 . 
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4 State Entropy Computation Procedure 

Given the results of the previous section, the authors of |7] suggest the following 
method for computing Hm^- 

Consider the directed multigraph Gm- This graph corresponds to the au- 
tomaton = {Q,S,5,s,F)oi the language M as follows. The edges are labeled 
with elements of E and the set of vertices Sm are the non-empty states of M . 
There is an edge (Q, x, Q') from Q to Q' with label x if and only if Q' = 
that is, if there is a transition from state Q to state Q' on input x. The set of 
edges we call Em- We write Q\~ Q' \i there is an edge from Q to Q' , and we let 
h* be the reflexive and transitive closure of h. A non-empty subset K of Sm is 
a strongly connected component of Gm if for any two states Q, Q' G AT, one has 
Q h* Q' h* Q. We extend h to components AT, K' by writing AT h A'' if and only 
if Q h Q' for some Q € K and Q' € AT'. 

The states are then numbered according to the following rules: 

1. Ms, the start state, is state number 1. 

2. The states in a strongly connected component are numbered consecutively. 

3. If Q and Q' are in different strongly connected components and Q h* Q' 

then the number of Q is strictly less than that of Q'. 

This numbering induces a numbering of the strongly connected components. 
Under the new numbering of states the adjacency matrix Um of M will be in 
upper block diagonal form: 



Um 



/A, \ 

0 ■■.... 

V 0 0 aJ 



( 8 ) 



where k is the number of strongly connected components and Ai corresponds 
to the transitions within component i. Hm^ is obtainable using the following 
theorem. 



Theorem 4. If M is finite-state and closed and K is a strongly connected com- 
ponent of Gm an-d Q G K then 



Hq = logmaxjax/ | K h* AT'} 



(9) 



where ax' is the maximal eigenvalue of Ax' ■ 

The computation of the entropy of the states of M is reduced to the compu- 
tation of the eigenvalues of each Ai of Um- 



5 State Entropy Computation Implementation 

For the first time, the proposals reviewed in the previous section have been 
implemented, and we describe this implementation here. A suite of software has 
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been written by the author in C. There is program for precisely encoding a 
binary image as an automaton, a program to compute the state entropies of 
that automaton, and a third program to render both the automaton-generated 
image and the entropy map image at any resolution that is a power of two. 

The program that encodes images as an automaton is very similar to the 
algorithm given in |3 for greyscale images, except it does not consider the simple 
transformations on the subimages. 

The state entropy computation program is based on the theory presented 
in the previous sections and is described in more detail here. The rendering 
program will be discussed in the next section. 

In this section, we restrict M to be a language of image quadtree addresses 
over the alphabet S = {0,1, 2, 3}. From a finite quadtree, we infer the graph Gm- 
Gm corresponds precisely to the classical finite automaton = {Q, S, 6, s, F) 
that encodes the image. Inference algorithms have already been published for 
grayscale images 0 and are easily adapted to black and white images. 

The state entropy calculation program reads in the infered automaton and 
constructs Gm- The strongly connected components (or simply components) are 
computed and each state is tagged with its resulting component number. 

A new graph Tm = {St, Ft) is then constructed by representing all of the 
states in a component by a single vertex. St thus contains an element for each 
component K in Gm and Et contains an edge {K,K') if and only if K h K' in 
Gm- 

It is easy to show that Tm is a tree. Since Tm is a tree, we can perform a 
topological sort on the tree. As we order each vertex of Tm, we assign consecutive 
topological numbers to the states in the corresponding component of Gm- At 
the conclusion of this numbering, the topological numbers are consecutive both 
within each component and over all of the states in Gm- During this step we 
also note the smallest topological number in each component, and construct 
a vector L such that ti is stored in Li . 

From the automaton, it is easy to build the adjacency matrix Um, one simply 
counts the number of transitions between each ordered pair of states. Instead of 
the original state numbers, however, we use the topological numbers of the state 
to index the adjacency matrix. This results in the upper block diagonal form 
matrix shown in equation 0 

Using the component size vector L we can now easily extract the submatrices 
Ai of Um- Specifically, 



A,= 



( {Um)c 



\ ■■ 



( 10 ) 



where a = Li, and b = Ti+i — Li. 

The next step is to compute for each Ai the maximal eigenvalue ai, if it exists. 
Since these matrices are non-negative and irreducible, the Perron-Frobenius the- 
orem (see m) applies, which states that each such matrix has a unique positive 
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and real eigenvalue. Our interest is in the maximal eigenvalue ai of Aj, so the 
power method of computing eigenvalues, as described in |^, lends itself nicely 
to the problem, since, when the power method converges, it always converges to 
the largest eigenvalue. The power method will converge for our matrices because 
they are non-negative. 

Having computed each on, it is now possible to compute the state entropies 
of M according to Theorem 4. Essentially this theorem requires that, to find the 
entropy Hq of state q G Q, we find the component K' with the largest submatrix 
eigenvalue that is reachable from state q, that is, a^', and take the logarithm 
of that eigenvalue. This can be simplified further, since, in component K, every 
state in K is reachable from every other state in K. Therefore, if we find Hp for 
any one state p in K, then we have found Hq for all q G K. This reduces this 
step to finding for each component K, the component K' with the largest matrix 
eigenvalue that is reachable from K (which could be K itself), then assigning 
aK' to each state q G K. K' can be found easily from Tm using a depth first 
search algorithm. The resulting state entropies for all g G Q are output to a data 
file for later consumption by the entropy map renderer. 

6 Entropy Maps 

As observed in Sect. 2, if we allow all of the states in Gm to be final states, 
then we can render the image at any 2" by 2" reolution by determining the 
acceptance or non-acceptance of all finite words in A". The rendering program 
thus reads in the automaton Gm, and the state entropies are read into a vector 
H. If a word w = wiW 2 ■ ■ ■ Wn is accepted, then the pixel at the quadtree address 
w is turned “on” . The entropy map is constructed by doing exactly the same 
thing except instead of turning “on” the pixel of accepted quadtree address w, 
we mark the corresponding pixel with the value of Hq where q is the state that 
accepted w. 

If w is rejected at the input wt (due to an absent transition), then the address 
w is assigned an entropy value of 




where q is the current state when Wi is read. The same is true if Wi results 
in a transition to the state which represents the image with all pixels turned 
“on”. This is done because in both cases, the automaton is about to make a 
transition to a state that represents either the all black image, or the all white 
image. We call these “flat” states. Both such states have an entropy of zero. 
This does not make sense when you consider, for example, a black and white 
pixel checkerboard pattern which intuitively should not have an entropy of zero. 
This problem arises because we are using an automaton generated from a finite 
quadtree. The inverse exponential coefficient compensates for this. The larger i 
is, the more closely the coefficient approaches unity. Thus if we reach a fiat state 
at the pixel level, where i = n, then the pixel in question will receieve the full 
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non-zero entropy value of the previous state. If we reach a flat state early in the 
reading of w where i n (which corresponds to a large flat area in the image) 
then we get an entropy value that is close to zero because the coefficient is very 
small. This is exactly what we would expect intuitively. 

The entropies in H are, in general, not integers; hence we we cannot directly 
generate an image from them. We can, however, before assigning entropies to 
pixels, map the range of entropy values in H into the range [0,255]. This will 
produce a greyscale image where greylevel is an indication of relative entropy. 
We apply the following transformation g : H ^ [0,255] to each h G H: 



g{h) 



255 X 



h hmii 
hm.a.T. hr. 



(12) 



where hmin and hmax are the largest and smallest elements in H respectively. 

Having assigned entropy values to each quadtree address and applied g to 
each such value we then create an image by converting the quadtree addresses 
to pixel coordinates, and assign the transformed entropy values to the pixels. 
The result is an greyscale image which shows how local entropy varies over the 
original image. For further enhancement, one can use the 8-bit greylevels to 
index a color palette and the image can be converted into a false color entropy 
map. 

It should be noted that if one desires to compare entropy maps of two different 
images visually, then g is not an appropriate mapping because it normalizes the 
raw entropy values. The same greylevel in two different entropy maps will not 
necessarily represent the same actual entropy value. 

We conclude this section with a few example entropy maps. 



Example 1. Figure 1 shows a binary image of the author which was generated by 
a finite automaton, and the corresponding local entropy map. White indicates 
the areas of highest entropy, while black represents the areas of lowest entropy. 
A notable feature of this entropy map are that flat black and flat white regions, 
such as the image background and teeth area respectively, are both represented 
by areas of zero entropy. Also of interest is that the lights in the background, 
which are approximately flat white areas on a flat black background, show up 
as having high entropy only along the edge of the light, white the flat areas on 
either side of the edge show very low entropy. Finally one notes that the gradient 
from the shit collar to the shoulder also shows up as a slight gradiant in entropy. 



Example 2. In figure 2 we see an example of a hand-designed automaton to 
illustrate that the strong components really do result in regions of different 
entropy. The automaton that generates the picture on the left has four different 
components, each generating one quadrant of the image. Although difficult to 
see without color enhancement, each quadrant shows a different entropy value in 
the right hand image. Addtional entropy levels are introduced by the exponential 
coefficient discussed above. 
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Fig. 1. Left: a picture of the author as generated by a finite automaton. Right: the 
local entropy map of the left image. 




Fig. 2. Left: A hand-drawn automaton with four strong components. Right: The local 
entropy map of the left image. 



Example 3. In figure 3, the source image is a binary image of Dr. Helmut 
Jiirgensen. 

Note 1. Dr. Jiirgensen is my PhD Thesis Supervisor.. The corresponding en- 
tropy map is shown beside the original image. There are less areas of complete 
blackness or whiteness in this image, compared to figure 1, so overall it is more 
disorderly as one would expect. 

Color versions of these example entropy maps can be found at the web address 
http : //www. csd. uwo . ca/~meramicin/wia99 
and are also printed in 
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Fig. 3. Left: A binary image of Dr. Helmut Jiirgensen. Right: The local entropy map 
of the left image. 



7 Results and Conclusion 

We have shown the first examples of entropy maps of an image created from no 
information other than the automaton encoding of the image. We plan to apply 
this technique to a large variety of black and white images and to extend it to 
greyscale and color images. 
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Abstract. A finite-state machine is called a Thompson machine if it can 
be constructed from a regular expression using Thompson’s construction. 
We call the underlying digraph of a Thompson machine a Thompson di- 
graph. We establish and prove a characterization of Thompson digraphs. 
As one application of the characterization, we give an algorithm that 
generates an equivalent regular expression from a Thompson machine in 
time linear in the number of states. 



1 Introduction 

In 1968, Thompson |B| gave an inductive construction of finite-state machines 
from regular expressions that was motivated by grep . The resulting finite-state 
machines have sizes linear in the sizes of the original expressions. A resurge of 
interest in the implementation of machines has resulted in some new discoveries 
about the Thompson construction m 

We characterize the underlying digraphs of the machines resulting from the 
Thompson construction on empty-free regular expressions (they do not include 
the empty-set symbol): Thompson digraphs and Thompson machines, re- 
spectively. 

First, we characterize Thompson digraphs that are obtained from empty- free, 
star-free regular expressions. These digraphs are acyclic; therefore, we call them 
Thompson dags. We use Dyck languages defined by source-sink paths in a 
Thompson digraph in the characterization. 

Second, since Thompson digraphs are hammock^, we can easily find all 
back edges with a depth- first traversal in linear time. Therefore, in linear time, 

* This research was supported under a grant from the Research Grants Council of 
Hong Kong SAR. It was carried out while the first and second authors were visiting 
HKUST. 

^ In the literature hammocks are often called st-digraphs. 
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we can determine where the star units occur and we can then transform the 
digraph into a dag with what we call star reduction. The resulting dag is a 
Thompson dag if and only if the original digraph is a Thompson digraph. 

The characterization provides us with a means of generating small regular 
expressions from some finite-state machines. We first determine whether a given 
finite-state machine is Thompson and, if it is, we construct a small equivalent 
regular expression from the machine. 



2 Notation and Terminology 

We recall the basics of digraphs, finite-state machines and regular expressions 
and introduce the notation that we use. 

A directed graph or digraph G = {V,E) consists of a finite set V of vertices 
and a set E of directed edges of the form {u,v), where u and v are vertices. A 
path is a sequence (vq, f;!), (ui, U 2 ), • . . , {vk-i,Vk) of edges; it is a cycle if vq = Vk 
and fc > 1. A path is a simple path if it contains no cycles. A digraph that has 
no cycles is called a directed acyclic graph or dag. The size of a digraph is 
the sum of the number of vertices and number of edges. 

We are particularly interested in digraphs that have a single designated 
source vertex s that has no edges entering it, a single designated sink ver- 
tex S that has no edges exiting it, and each of its vertices occurs on some simple 
path from the source to the sink. Such digraphs are called hammocks and we 
denote them with a tuple {V,E,s,S). 

A finite-state machinal consists of a finite set Q of states, an input al- 
phabet A, a start state s G Q, a final state f G Q and a transition rela- 
tion S C Q X E\ X Q, where A denotes the null string and E\ = E U {A}. 
Clearly, we can depict the transition relation of such a machine as an edge- 
labeled digraph (the labels are symbols from Aa); it is usually called the state or 
transition digraph of the machine. If we drop the edge labels of a state digraph 
and ignore multiple edges, we obtain a digraph, the underlying digraph of the 
machine. The size of a machine is the number of its transitions. 

Let E be an alphabet. Then, we define a regular expression E over E induc- 
tively as follows: 

£1 = 0, where 0 is the empty-set symbol; 

E = X, where A is the null-string symbol; 

E = a, where a is in E; 

E = {E + G), where E and G are regular expressions; 

E = {E ■ G), where F and G are regular expressions; 

E = (£"*), where A is a regular expression. 

We define the language of E inductively, in the usual way m- 

^ Normally, a finite-state machine is allowed to have more than one hnal state and, 
sometimes, more than one start state. The formnlation we have chosen is appropriate 
for the stndy of Thompson machines. 
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We say that a regular expression E is empty free if E does not contain any 
appearance of the empty-set symbol. We say that a regular expression is star 
free if it has no Kleene-star subexpression. The size \E\ of a regular expression E 
is the total number of appearances in E of symbols from E U {A, 0}. 

3 Thompson Digraphs 

Thompson developed an inductive construction (see Fig. 0 to compile reg- 
ular expressions into finite-state machines. In Fig. |21 we give the result of the 




Fig. 1. The Thompson construction. The order of the figures corresponds to the order of 
the cases in the definition of regular expressions. The finite-state machines correspond 
to the regular expressions: a. E — i/}; h. E = X\ c. E = a, a G E-, d. E = {F + G)\ e. 
E = {F ■ G)\ and i. E = {F*). When a given regular expression E is empty free, (a) is 
never used by the Thompson construction. We include (a) for completeness. 



Thompson construction on the regular expression (((a + h)*) ■ {{b + A) • a)). 

We define a Thompson machine to be a finite-state machine that is ob- 
tained by the Thompson construction on an empty-free regular expression and 
we define a Thompson digraph to be the underlying digraph of a Thompson 
machine. We name the units that the Thompson construction uses to assemble 
a Thompson machine as follows: the base units (Fig. CJb) and (c)); the plus 
unit (Fig. n^d)); the dot unit (Fig.n(e)); and the star unit (Fig. PT^f)). 
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Fig. 2. The result of the Thompson construction on the running example expression 
(((a + b)*) ■ {{b + X) ■ a)). 



Notationally, given an empty-free regular expression E, we denote the re- 
sulting Thompson machine by M'^ and, similarly, we denote the corresponding 
Thompson digraph by H]^. 

Observe that a Thompson digraph is a hammoclfl and it has exactly the same 
number of edges as the number of transitions in the original Thompson machine. 
The Thompson machine of Fig. |5| yields the Thompson digraph of Fig. 0 




Fig. 3. The Thompson digraph given by the Thompson machine of Fig. 0 

We can obtain a Thompson digraph directly from a regular expression by 
modifying Thompson’s construction appropriately. 



4 The Characterization Theorem 

Not only is a Thompson digraph a hammock, but also the vertices of a Thompson 
digraph have indegree and outdegree at most two. We call hammocks that satisfy 
this additional restriction two hammocks. 

We characterize the two hammocks that are Thompson digraphs. 

We begin with a modification of the usual definition of Dyck strings. For an 
integer t > 1, let = {[■ : 1 < i < t}, = {]i : 1 < * < i}> A = U 

and Ft = Ft U {ixi}, where [xi is a special symbol used to label vertices that 

® We can establish this observation formally using the inductive construction. If we 
allow regular expressions to contain the empty-set symbol, then the digraphs of the 
resulting machines are not necessarily hammocks. 



Thompson Digraphs: A Characterization 



95 



are of indegree and outdegree one. The notion of Dyck string over Ft is well 
known; such a string is well-balanced with respect to the pairing of with ]^, 
for all t, 1 < i < t. We need to allow the pairing to be given by any bijection 
(3 : t} — t} not only by the identity bijection. We call such 

strings, b-Dyck strings and we define them as follows. Given an alphabet Ft 
and a bijection (3 : {1, . . . , t} — {1, . . . , t}, we define b-Dyck strings over Ft 
with respect to /3 inductively as follows: The null string X is a b-Dyck string over 
Ft and whenever x and y are b-Dyck strings over Ft, xy is a b-Dyck string over 
Ft and is a b-Dyck string over Ff, for all i, 1 < i < t. 

We now define a fundamental set of strings for Thompson digraphs. Given an 
alphabet Ft and a bijection P : {1, . . . ,t} — t}, a string x over Tt is a 
Thompson string with respect to P if it satisfies the following three conditions: 

1. If we erase all appearances of cxi from x, we obtain a b-Dyck string over Ft 
with respect to p. 

2. The substring does not appear in x, for any i and j, 1 < i, j < t. 

3. All maximal- length cxi-substrings of x have even length. 

We develop a characterization of Thompson digraphs by first considering the 
Thompson digraphs that are obtained from star-free expressions. In this case, 
we obtain Thompson dags. Since Thompson digraphs are two hammocks, we 
refer to a vertex with indegree i and outdegree j as an (i,j) vertex. There are 
no (2, 2) vertices in Thompson dags. Indeed, (2, 2) vertices are produced only 
by the Kleene star of a plus subexpression that has the form {{F -\- G)*). For 
convenience we assume that the source and the sink vertices have a dummy in- 
edge and a dummy out-edge, respectively. Then, we have the following necessary 
condition for a two-hammock i/ to be a Thompson dag. 

Property 1: A Thompson dag has only (1, 2), (1, 1), and (2, 1) vertices, it has 
an even number of (1,1) vertices and it has as many (1,2) vertices as it has 
(2, 1) vertices. 

Definition 1: As a result of Property 1, let the number of (1,2) vertices (or, 
equivalently, the number of (2, 1) vertices) be t. We label the (1,2) vertices 
arbitrarily and uniquely with [j, 1 < i < t; the (1,1) vertices with to; and 
the (2, 1) vertices arbitrarily and uniquely with ]j, 1 < j < t. The resulting 
digraph is a Tj-labeled two hammock. If the hammock is a Thompson dag, 
then the resulting dag is a Tt-labeled Thompson dag. 

Note that, if t = 0, then the two hammock is a line digraph. 

We can now state a second necessary condition for a two-hammock iJ to be 
a Thompson dag. 

Property 2: For each Tj-labeled Thompson dag G, there is a bijection P : 
{1, . . . , t} — >■ {1, . . . , t}, such that every source-sink path in G spells out a 
Thompson string. 
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A Thompson string satisfies three conditions, the first two are easy to check 
directly. We discuss the checking of the b-Dyck condition of Property 2 when we 
discuss how to check for Property 3 later in this section. 

Lemma 1. Each Thompson dag satisfies Properties 1 and 2. 

Proof. 0 Let if be a star-free and empty-free regular expression such that PI is 
the Thompson digraph obtained from E. We prove the result by induction on 
the size of E. 

Notice that Properties 1 and 2 are not sufficient for an acyclic two ham- 
mock to be a Thompson dag; for example, consider the dag of Fig. 0| Although 
all source-sink paths in this dag spell out Thompson strings, the dag cannot 
be obtained by applying the Thompson construction to any star-free regular 
expression. The reason is that we also need to verify that every path from a 




Fig. 4. An example for the insufficiency of Property 2. 



[-vertex passes through the same matching ]^-vertex. We express this condition 
using specific decompositions of b-Dyck strings. 

Definition 2: Let iL be a Tt-labeled two hammock and let p be (1,2) vertex 
in P[. Then, we define Lp to be the set of strings spelled out on paths from 
the source vertex to the sink vertex that pass through p. 

Property 3: Let H he a, E (-labeled Thompson digraph. Then, there is a bijec- 
tion /3 : {1, . . . , t} — > {1, . . . , t} such that, for all (1, 2) vertices p in P[ and 
for all strings x in Lp, x can be decomposed into w[i'c]^(()W, where [^ is the 
label of p. 



Lemma 2. Let PI he a Thompson dag. Then, PI satisfies Property 3. 

We are now in a position to characterize Thompson dags. 

Theorem 1. An acyclic two hammock is a Thompson dag if and only if it sat- 
isfies Properties 1, 2 and 3. 



We omit most proofs — they are to be found in the full version of the paper [S]. 
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Proof. We already proved, by Lemmas Q] and |2 that, if iL is a Thompson dag, 
then it satisfies Properties 1, 2 and 3. We now prove the converse. Let H be 
an acyclic two hammock that satisfies Properties 1, 2 and 3. To demonstrate 
that iJ is a Thompson dag, we construct a regular expression E such that H is 
isomorphic to 





Fig. 5. Three Thompson digraph transformations: a. Replacement of plus unit. b. Re- 
placement of dot unit. c. Replacement of star unit. 



We now characterize the two hammocks that are Thompson digraphs. Let H 
be a two hammock. We perform a depth-first traversal of H starting from the 
source vertex. It determines a set of back edges B = {(yi,a;i), . . . , {yk,Xk)}, 
where k > 0. Notice that when (yi, Xi) is a back edge, the vertex Xi has indegree 
two. For, if Xi has indegree one, then there is no simple source-sink path that 
contains Xi. Similarly, has outdegree two. For, if yi has outdegree one, then 
there is no simple source-sink path that contains yi. We now state a necessary 
property for a two hammock to be a Thompson digraph. 

Property 4: For each back edge (y, x) in a Thompson digraph iJ, there are 
two vertices w and z such that w, x, y and z are distinct and edges (w,x), 
(y, z) and (w, z) are in H. 

Based on Property 4, we define a digraph transformation that removes each 
back edge (y,x) and expands each edge {w,z) into three edges. Thus, if a two 
hammock has only Thompson-like cycles, the digraph transformation will trans- 
form them to give a dag. We define the star-reduction dag H as follows: for 
each back edge (y^, Xi), remove the edge {yi, Xi) and replace the edge {wi, Zi) with 
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the three edges (wi,Ui), (ui,Vi) and (vi,Zi), where Ui and Vi are new vertices; 
see Fig. 0 




Fig. 6. An illustration of star reduction. 



Theorem 2. A two-hammock H is a Thompson digraph if and only if H satis- 
fies Property 4 <in-d H is a Thompson dag. 

Proof. It is easy to verify that, by definition, if is a Thompson digraph, then 
each back edge has corresponding forward edges as specified in Property 4 and 
H is a, Thompson dag. 

Conversely, suppose that we have a two-hammock H that satisfies Property 4 
and H is a Thompson dag. We can construct a regular expression E correspond- 
ing to H as in the proof of Theorem 0 Then, we can obtain a regular expression 
E for iJ from E by replacing subexpressions of the form {F -\- X), produced by 
star reduction, with (F*). We omit the formal inductive proof that = H as 
it is similar to the proof of Theorem 0 

Given a digraph we can verify whether it is a two hammock in time linear 
in the size of iJ. Moreover we can check whether it is a Thompson digraph in time 
linear in the size of H in two steps as follows: First, using a depth-first traversal 
of H from the source vertex, we detect all back edges and then check whether 
Property 4 holds. Second, if Property 4 holds, then we apply star reduction to 
p[ to give a dag PI' and then check by depth-first traversal whether H' is a 
Thompson dag. We give a more formal description of the nontrivial portion of 
a Thompson-dag recognition algorithm elsewhere 0 . 

5 Regular Expressions from Thompson Machines 

Once we have confirmed that a given edge-labeled digraph is an edge-labeled 
Thompson digraph, we can construct a regular expression from it whose size is 
linear in the size of the given machine and we can do so in linear time. 

Theorem 3. Given a Thompson machine M,we can construct an equivalent 
regular expression from it in time linear in the size of M. 
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Proof. The idea behind the proof of this result is that we, first, parse the un- 
derlying edge-labeled Thompson digraph using a depth-first traversal to obtain 
an expression tree and, second, perform a depth-first traversal of the expression 
tree to produce a correctly parenthesized regular expression. Clearly, the second 
step is straightforward and can be implemented to run in time linear in the size 
of the expression tree. We claim that the size of the expression tree is of the same 
order as the size of the Thompson digraph. We can construct a parsing algorithm 
for Thompson digraphs that takes time linear in the size of a digraph |S|. The 
algorithm assumes that for each starting vertex of a unit, including a base unit, 
the j3 map gives the corresponding ending vertex. 

It is easy to verify that each edge of H is traversed exactly once by the 
parsing algorithm; therefore, it takes time linear in the size of H . Moreover its 
correctness follows because H satisfies Properties 1, 2 and 3. 

The parsing algorithm for Thompson digraphs is, essentially, the inverse of 
the Thompson construction. For, if we take a regular expression E, construct the 
Thompson machine and then apply the parsing algorithm to M’£, we obtain 
a regular expression E that is equivalent to E. Note that, although E and E are 
equivalent, E may be differently parenthesized and the order of subexpressions 
of plus units may be different. 

6 Concluding Remarks 

We have established a characterization of Thompson digraphs that enables us 
to unambiguously parse such digraphs and reconstruct regular expressions from 
them that have the same sizes as their digraphs. The interesting fact is that we ig- 
nore the transition labels completely in the characterization. The earlier work of 
Caron and Ziadi 0 gives a second class of machines for which the construction of 
equivalent regular expressions can be carried out efficiently. Can similar charac- 
terizations be established for other inductively constructed finite-state machines? 
We conjecture that such results hold for Mirkin’s construction and the SSS 
construction jjj. In any case, we conjecture that the finite-state machines given 
by these constructions can be unambiguously parsed. One interesting problem is 
whether we can unambiguously parse the machines given by other constructions 
in the literature including Kleene’s original construction. 

A tantalizing open problem is to characterize the largest class of finite-state 
machines that have small expressions easily computable from the machines. A 
less ambitious goal is to identify nontrivial classes of finite-state machines that 
yield small expressions. 
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Abstract. Finite automata are used for the encoding and compression 
of images. For black-and-white images, for instance, using the quad-tree 
representation, the black points correspond to a;- words dehning the corre- 
sponding paths in the tree that lead to them. If the o;-language consisting 
of the set of all these words is accepted by a deterministic finite automa- 
ton then the image is said to be encodable as a finite automaton. For 
grey-level images and colour images similar representations by automata 
are in use. 

In this paper we address the question of which images can be encoded 
as finite automata with full infinite precision. In applications, of course, 
the image would be given and rendered at some finite resolution - this 
amounts to considering a set of finite prefixes of the w-language - and 
the features in the image would be approximations of the features in the 
inhnite precision rendering. 

We focus on the case of black-and-white images - geometrical figures, 
to be precise - but treat this case in a d-dimensional setting, where d is 
any positive integer. We show that among all polygons in d-dimensional 
space those with rational corner points are encodable as hnite automata. 
In the course of proving this we show that the set of images encodable 
as finite automata is closed under rational affine transformations. 
Several simple properties of images encodable as finite automata are 
consequences of this result. Finally we show that many simple geometric 
hgures such as circles and parabolas are not encodable as finite automata. 



1 Introduction 

Finite automata are widely used as a means for describing certain fractals (see P] 
EI3). Usually, the investigation of automaton-generated fractals starts from the 
underlying automaton and aims at a description of the image or the calculation 

* The research reported in this paper was partially supported by the Natural Sciences 
and Engineering Research Council of Canada, Grant OGP0000243. After completion 
of this paper we succeeded in strengthening some of the results. A full version of this 
paper, which includes all proofs and the recent results, is in preparation j^. 
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of some of its parameters like density, dimension or measure (see [tUdj ) . Less 
is known about the converse direction, that is, starting from a class of images 
to ask whether they are generated by automata or, if so, to describe these au- 
tomata. Some structural properties of images generated by finite automata can 
be derived from the structure of the w-languages accepted by the automata. 
Finite-automaton generated images turn out to have specific shapes (see e. g. P 
□ ). 

We focus on d-dimensional black-and-white images. Using their representa- 
tion as infinite (ordered) trees with a branching of up to 2‘^ - in the case of d = 2 
these are quad-trees - the black points correspond to the infinite branches in 
these trees. Hence an image would be represented by the w-language describing 
these branches. An image is encodable as (or definable by) a finite automaton if 
its w-language is accepted by that automaton, that is, if that w-language is reg- 
ular (see nm). The cases of grey-level or colour images would require additional 
parameters. 

The encoding of an image as an automaton represents the image at an infi- 
nite resolution. Sampling or rendering the image at a bounded resolution corre- 
sponds to running the automaton for a bounded time only. These connections 
are exploited, for example, in an automaton-based image compression procedure 
(see H). 

We address the question of which images are encodable as finite automata. 
In particular, we consider polygons and simplexes in d-dimensional Euclidean 
space, that is, convex hulls of finite sets of points. 

The main theorems of this paper state that a d-dimensional simplex is de- 
finable by a finite automaton if it is the convex hull of a finite set of points 
with rational coordinates, and a polygon is definable by a finite automaton if 
and only if its corner points are rational. This result is independent of the base 
chosen for the number representation. The set of images definable by finite au- 
tomata being closed under union, projection, inverse projection and, essentially, 
also differencefl it turns out that the class of geometrical figures definable by 
finite automata is quite rich. 

One of the main tools for proving this result is the following property of 
images encodable as finite automata: The set of these images is closed under 
rational affine transformations, that is, transformations of the form \) = + b 

with only rational numbers as entries of the transformation matrix A and the 
translation vector b. 

From closure properties of the set of regular w-languages and these results, 
one can determine further interesting classes of simple geometrical figures en- 
codable as finite automata. On the other hand, some very simple geometrical 
figures like circles or parabolas cannot be encoded as finite automata. For im- 
age compression by automata this implies that such figures will, of necessity, be 
approximated by simplexes sampled at some bounded resolution. 



^ We consider figures that are bounded and closed in Euclidean space. Therefore, 
difference here means the closure of the set theoretical difference. 
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2 Notation 

The symbols N, Z, Q and K denote the sets of non-negative integers, integers, 
rational and real numbers, respectively. An alphabet is a finite and non-empty 
set. For an alphabet A, X* and A“ denote the sets of finite and right-infinite 
words over X, respectively. For a word w G A*, lu"! is its length. Right-infinite 
words are referred to as tu- words in the sequel. An tu-language is a set of w- words. 
An w-language is regular if it is accepted by a finite automaton (see PO] for the 
relevant background and references). 

For any alphabet Y and any positive integer d, let \Y,d] denote the d-fold 
Cartesian product 

[y,d] = r X ... X Y . 

d times 

For y = ( 2 / 1 , ... , y,j) € \Y,d] and an integer i with 1 <i < d, the i-th projection 
of y is proj, y = yi- 

For the representation of real numbers, we fix a base r S N with r > 2. 
Then the set F = {0, 1, . . . , r — 1} is considered as the set of r-ary number 
symbols. Every real number in the closed interval [0, 1] = {x | 0 < x < 1} has 
a base-r representation of the form O.a where a € F“. In particular, a finite 
representation of a rational number can be padded by an infinite sequence of 
the symbol 0. Conversely, every w-word a over Y denotes a unique real number 
Vr{c^) in the interval [0, 1], represented by O.a. It is well-known that the mapping 
from representations of numbers to their values is not injective. 

Let d be a positive integer. To specify points in the closed d-dimensional unit 
cube [0, 1]'^ we use w-words over the alphabet A = [Y,d], For ^ = X 1 X 2 . . . S A“ 
and an integer i with 1 < * < d, the i-th projection of ^ is the w-word 

Proji ^ = proj, xi proj, X 2 • • • 

obtained from the i-th projections of the symbols of x. The point r'r(C) in [0, 1]“^ 
defined by ^ has, as coordinates, the values of the numbers represented by the 
projections of 

We generalize this concept of projection to multiple coordinates. Consider 
y = {yi, ■ ■ ■ ,yd) S A = [F, d], fc S N, k > 0, and a fc-tuple i = {ii,. . . ,ik) of 
integers in {!,..., d}. Then proj^j/ = {yi^,...,yij € [Y,k], For ^ = X 1 X 2 • • • G 
A“, the projection proj|^ is the w-word 

proj| Xi proj| X 2 • • • 

in [F, fc]“ and its value z/r(proj|^) is a point in [0, 1]*. Let pr^ denote the corre- 
sponding projection of [0, l]'^ into [0, 1]^. The following diagram is commutative. 

[F,fe]“ 



[ 0 . 1 ]" 

The mapping proj^ and its inverse preserve regularity of w-languages. 



[F,dr 
[ 0 , 1 ]'* 
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Let rji,...,r]d G By slight abuse of notation we write (rji , . . . , rjd) to 

denote the w-word ^ G d]“ such that projj ^ = ? 7 i for i = 1 , . . . , d. 

On one defines an ultrametric g by 

— inflr”!™! | ru is a common prefix of C and ^}. 

Since X is finite, the space (X‘^,g) is a compact metric space. Moreover, the 
mapping of onto [ 0 , 1 ]"^ is continuous. 

3 Rational AfRne Transformations 

Consider a function (p : — >■ M and an w-language F over X. The function ip is 

said to describe the uj-language F if F is the largest w-language such that Vr{F) 
is the set of all solutions in [ 0 , of the equation 

p{xi , . . . ,Xd) = 0 , 

that is, 

F = ({(a^i) ■■■,Xd) I p(xi, . . . ,Xd) = 0 , 0 < Xj < 1 for i = 1 , . . . ,d}) . 

We write to denote the w-language described by p. The set F^ contains all 
base-r representations of all solutions in [ 0 , of the equation above. 

The following lemma plays a fundamental role in some of the proofs: 
Lemma 1. Let — >■ R be a function, Ci G {—1, +1}, 1 < * < d, and c G Z 

such that 

p{xi, ...,Xd) = cixi H h CdXd + c. 

Then F,p is regular and closed. 

We list a few immediate consequences. As is well-known every rational 
number of the form k/r^ has two base-r representations. Thus a point in d- 
dimensional space R“^ may have up to 2‘^ representations. A typical complication 
arises from the fact that, due to those multiple representations, for F, F' C A“, 
the sets Vr{F) fl r'r(F') and Vr{F fl F') might not be equal. For example, with 
d = 1, r = 2, F = {1000 • • •} and F' = (01111 • • •} one has r'r(F) = Vr(F') = { 5 } 
whereas Vr{F fl F') = 0. However, for any F, F' C A“, one has 

Vr{F) n t'r-(F') = Vr [vf^{Vr{F)) fl F') . 

One is, therefore, led to work with full representations, that is, with w-languages 
F satisfying F = v~^ {vr{F)). 

Proposition 1 The uj-languages 

E( 2 d) ^ I ^ g 2 d]“ with J^r(proj, f) = i/r(proji+^ C) fori=l,...d} 

and 

I ^ g ^ J^r(pro,i 2 0 = Vr{wo]i+i f) for i = I, ■■■ ,m) 

are regular. 
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As a consequence, moving from a regular representation to the corresponding 
full representation preserves regularity. 

Proposition 2 Let F he an Lo-language over X = If F is regular then 

also {vr{F)) is regular. 

We now include integer and, consequently, also rational coefficients. 

Proposition 3 Consider a function — >■ R such that ip{x\,X 2 ) = xi — mx 2 

for some m S Z. Then F^ is regular. 

Exploiting the proof techniques of of Propositions Ed one can extend 
LemmaHto rational coefficients. 

Lemma 2. Let ip : R'^ — ^ R &e a function, Ci,c € Z, 1 < i < d, such that 



if{xi, ...,Xd) = cixi H h CdXd + c. 



Then Fcp is regular and closed. 

An affine transformation of R^ into R^ is given by an equation of the form 
t) = Ay + b where t) and b are 1 x fc-vectors, y is a 1 x d-vector and A is a 
k X d-matrix. An affine transformation is said to be rational if the entries of A 
and b are rational. 

Theorem 1. Let W : R"^ — >■ R^ he a rational affine transformation and let 
F(fF) C he its graph. Then the uj-language F^ = (^FiF) n [0, is 
regular. 

From Theorem 0one concludes that rational affine transformations and their 
inverses preserve regularity. 

Theorem 2. Let F : R"^ — > R^ and F : R^ — >■ R'^ he rational affine transforma- 
tions and let F C X‘^ he regular. Then hath 

{F{vr{F)) n [0, 1]'=) and {F~\vr{F)) n [0, 1]'=) 

are regular ui-languages. 

4 Simple Geometric Figures 

A point in [0, l]'^ is said to be rational if all its coordinates are rational; it is said 
to be nearly rational if at least d — 1 of its coordinates are rational. A simplex in 
[0, 1]'^ is the convex hull of a finite set of points in [0, 1]'^. A simplex in [0, 1]'^ is 
said to be rational if it is the convex hull of finitely many rational points. Using 
Theorem 121 one obtains a sufficient condition on the encodability of simplexes 
as finite automata. 

Theorem 3. A simplex in [0, 1]“^ is encodahle as a finite automaton if it is 
rational. 
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Using the closure properties of regular oj-languages and taking into account that 
we need to work with full representations one realizes that set of simple geometric 
figures definable by finite automata is quite large. 

Since the closure and the boundary of regular oj-language are again regular, 
we obtain the following closure property of the family of images definable by 
finite automata 

Proposition 4 Let M C [0, 1]“^ be eneodahle as a finite automaton. Then both 
the closure M and the boundary dM of M are encodable as finite automata. 

We obtain a characterization of polygons encodable as finite automata. 

Theorem 4. A polygon in [0,1]'^ is encodable as a finite automaton if and only 
if its corner points are rational. 

In the course of proving Theorem ^ one derives several criteria for the en- 
codability of line sets in the unit interval as automata. 

Proposition 5 Let M C [0, 1]"^ be encodable as a finite automaton. Lf M is 
non-empty then it contains a rational point. Lf M is countable, then all points 
in M are rational. 

Lemma 3. Let I be finite or denumerable index set and, for i G I, let Oi,bi G 
[0, 1] with Ui < bi,- Ri be an interval of the form {ai,bi), [ai,bi), (0^,6^] assuming 
Uifi^bi, or [ai,bi\. Lf, for i,j G I with i j, the intervals Ri and Rj are disjoint 
and the set M = is encodable as a finite automaton then ai and bi are 

rational for all i G I . 



5 Images That Are Not Encodable as Finite Automata 

There are many images that are not encodable as finite automata. Proposition 0 
states a necessary condition for an image to be encodable as a finite automaton. 
Here we state and apply other necessary conditions. 

Proposition 6 Lf a smooth non-constant curve M C [0,1]^^ is encodable as a 
finite automaton then every nearly rational point on the curve is rational. 

From Proposition 0 one finds many simple examples of (two-dimensional) 
images not encodable as finite automata, for instance: 

Example 1 The parabola /(a) = of with 0 < a < 1 is not encodable as a finite 
automaton because it contains the non-rational point (l/\/2, 1/2) which has one 
irrational and one rational coordinate. 

The next example uses also Theorem|2|in order to prove the nonencodability0 

^ We are grateful to one of the referees for providing us with this simple instructive 
example. 
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Example 2 Consider the hyperbola g{x) = l/(a; + l). Every point in r{g) with 
one rational coordinate is rational. Now transform r{g) via the rational affine 



mapping given by A 



1 0 
1 1 



and b 




. The image is r{g') where 



g'ix) = /{I + x) which contains the point ( 5 ) with exactly one rational 

coordinate. Thus g is not encodable as a finite automaton. 



Another necessary condition is implicit in the following lemma. 

Lemma 4. Let / : [0, 1] — >■ [0, 1] be a continuous function, differentiable at 
a point uq S [0,1] for which f'{ao) is irrational. Then the graph T{f) is not 
encodable as a finite automaton. 

For the proof one uses the following zoom-in property of images encodable as 
finite automata. 



M' 




For F C and w G X* , let = {^ | g F}. The set | w G X*} 

is finite for regular F. The converse is not true in general; see Pj for details. As 
a consequence, the number of different images obtainable as zoom-ins is finite if 
the image itself is encodable as a finite automaton. 

Corollary 1 Let f : [0, 1] — >■ [0, 1] be a continuously differentiable function with 
a non-constant derivative. Then the graph F(f) is not encodable as a finite 
automaton. 

This corollary explains, in addition to Examples 0 and also the following 

example. 

Example 3 No circle is encodable as a finite automaton. 
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honey for the same expenditure of material. 
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Mathematical Collection 



Abstract. This paper presents an eclectic approach for compressing 
weighted finite-state automata and transducers, with minimal impact on 
performance. The approach is eclectic in the sense that various com- 
plementary methods have been employed: row-indexed storage of sparse 
matrices, dictionary compression, bit manipulation, and lossless omission 
of data. The compression rate is over 83% with respect to the current 
Bell Labs finite-state library. 



1 Introduction 

Regular languages are the least expressive among the family of formal languages; 
hence, their computational counterparts, viz. finite-state automata (FSAs), re- 
quire the least computational power in terms of space and time complexity. 
This, as well as other advantages discussed below, makes FSAs very attractive 
for solving problems in a wide range of computational domains, including switch- 
ing theory, testing circuits, pattern matching, speech and handwriting recogni- 
tion, optical character recognition, encryption, data compression and indexing, 
not to mention a wide range of problems in computational linguistics. (For a 
recent collection of papers on language processing using automata theory, see 
(Roche and Schabes, 1997).) 

There are other advantages to using FSAs, as well as finite-state transduc- 
ers (FSTs). Firstly, they are easy to implement. A simple automaton can be 
represented by a matrix, though this is not necessarily the best solution space- 
wise as we shall see. Secondly, they are fast to traverse, especially in the case 
of deterministic devices. Thirdly, and may be most importantly, transducers 
are bidirectional, e.g., a letter-to-sound rule compiled into an FST becomes a 
sound-to-letter rule by merely inverting the machine. Finally, FSAs and FSTs 
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are mathematically elegant since they are closed under various useful operations. 
FSAs are closed under concatenation, union, intersection, difference and Kleene 
star. FSTs are closed under the same operations (except intersection and differ- 
ence under which only the subclass of e-free FSTs are closed - e represents the 
empty string). Transducers are also closed under composition. 

Despite all these niceties, large scale implementations of speech and language 
problems using this technology result in notoriously large machines, posing seri- 
ous obstacles in terms of time and space complexity. When a finite-state network 
becomes unmanageable, even with current computer machinary, the original net- 
work is usually divided into smaller modules that can be put together at run-time 
using operations under which FSAs and FSTs are closed. Even with this ‘trick’, 
the modules themselves still require massive storage. Tabled gives the number 
of modules, and their storage size in megabytes, for the text analysis compo- 
nents of the Bell Labs multi-lingual text-to-speech (TTS) system. The size of 
these modules becomes a serious hurdle when the system is to be imported into 
a special-purpose hardware device with limited memory. 



Table 1. Storage requirement for letter-to-sound rules in Bell Labs multilingual TTS 
system 



Language 


Modules 


Size (MB) 


German 


49 


26.6 


French 


52 


30.0 


Mandarin 


51 


39.0 



This work presents an eclectic approach for compressing FSAs and FSTs, 
with minimal impact, if any, on performance when a caching mechanism is acti- 
vated. The approach adopted here is eclectic in the sense that various comple- 
mentary methods have been employed: row-indexed storage of sparse matrices, 
dictionary compression, bit manipulation, and lossless omission of data. 

The implementation described herein builds on an object-oriented finite-state 
machinary library, designed originally by M. Riley and F. PereiraQ Data pro- 
vided throughout the paper is based on weighted finite-state transducers used 
in the language analysis components of the Bell Labs multilingual TTS system 
(Sproat, 1997). 

Section El discusses the characteristics of automata and transducers used in 
language and speech applications. Section 0 presents the compression approach 
employed here. Section 0 gives further eclectic compression methods. Finally, 
section 0 gives some results and brief concluding remarks. 

^ This library was implemented at AT&T Bell Labs. After the trivestiture of the 
company in 1996, the library was inherited, and developed further independently, 
by both AT&T Research (Mohri, Pereira, and Riley, 1998) and Lucent Technologies’ 
Bell Labs. The AT&T version is available online at 

http://www.research.att.com/sw/tools/fsm [URL checked July 1, 1999]. 
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2 On Speech and Langnage Transducers 

Before embarking on the task at hand, it is crucial to study some of the charac- 
teristics of speech and language FSTs. 

2.1 Lexical vs. Rule Machines 

Speech and Language automata tend to be of two types. Lexical-based automata 
describe a set of words, morphemes, etc. Fig. Ogives an automaton for a small 
English lexicon representing the words: /book/, /hook/, /move/, and /model/. 
Final states mark the end of a lexical entry. Note that entries which share prefixes 
(in the formal sense), such as “mo” in /move/ and /model/, share the same 
transitions for the prefix. The same holds for suffixes. 




Fig. 1. Lexical representation by automata. 



Rule-based FSTs represent a transformational rule. As a way of illustration, 
consider the derivation of /moving/ from the lexical morphemes /move/ and 
/ing/. Note that the [e] in /move/ is deleted once the two morphemes join. 
This final- e-deletion rule takes place following a consonant ([v] is this case) and 
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preceding the suffix /ing/. The FST representation of this rule is depicted in 
Fig. tl.M The transition *:* from a particular state represents any mapping other 
than those explicitly shown to be leaving that state. Given a particular input, 
the transducer remains in state 0 until it scans a ‘v’ on its input tape, upon 
which it will output ‘v’ and move to state 1. Once in state 1, the left context of 
the rule has been detected. 




Fig. 2. Transducer for deleting ‘e’ of /move/ in /moving/. Unless otherwise specified, 
the transition *:* stands for on all other symbols not specified in the respective state. 
‘Eps’ represents e. 



When in state 1 and faced with an ‘e’ (the symbol in question for this rule), 
the transducer has two options: (i) to map the ‘e’ to an e, i.e. apply the rule, and 
move to state 3, or (ii) retain the ‘e’ on the output and move to state 2. In the 
former case, one expects to see the right context since the rule was applied; the 
transition on i:i to state 5 and the subsequent transition back to state 0 fulfill 
this task. In the latter case, one expects to see anything except the suffix /ing/. 
Hence, it is not possible to scan /ing/ from state 2. 

In large-scale applications, lexical-based FSTs tend to be much larger than 
rule based ones. They also tend to be more sparse (see ll2.;tll . This observation 
has an impact on the compression results that can be achieved. 



2.2 Parallel vs. Serial Architecture 

There are two models for putting FSTs together in a regular grammar 
environment: parallel and serial. In the parallel model, each regular rule 
(or lexicon) is compiled into an FST using a compiler similar to that of 
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(Karttunen and Beesley, 1992). The entire grammar is then taken as the inter- 
section of all such FSTs. Fig. 0(a) illustrates this architecture. Note that since 
e-containing FSTs are not closed under intersection, this model is only valid with 
e-free grammars. 




(b) Serial Architecture 



Fig. 3. Parallel and serial architectures for putting FSTs together. 



The serial model, which is used to build the FSTs described here, allows 
for e-containing rules. As in the parallel model, each rule (or lexicon) is com- 
piled into an FST, albeit using a different algorithm (Kaplan and Kay, 1994; 
Mohri and Sproat, 1996). Under this model, the entire grammar is represented 
by the serial composition of all such FSTs. Fig. EJb) illustrates this approach. 
In both models intermediate machines tend to blow up in size. 
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The language analysis FSTs at hand represent hundreds of rules as well 
as various lexical transducers. Composing all rules and lexica into one FST at 
compile-time is not feasible space-wise since intermediate machines explode in 
size. Alternatively, composing the input with hundreds of compositions at run- 
time is computationally expensive time-wise. A middle solution, which is used 
in the Bell Labs TTS architecture, is to divide rules and lexica into subsets 
and compile the members of each subset into a module. The input can then be 
composed with these modules at run-time. Hence, the size of the intermediate 
machines will always be in the order of the input size. This is of course possible 
because composition is associative. 



2.3 Sparsity 

Let A = (Q, S,qo,S, F) be a (nondeterministic) finite-state automaton where Q 
is a finite set of states. S' is a finite set of symbols representing the alphabet, 
go G Q is the initial state, S: Q x S ^ 2^ (where 2^ denotes the power set of Q) 
is the transition function, and F C Q is a set of final states. The sparsity rate 
of A is 



Sparsit,{A) = 1 - (1) 

Our language analysis machines show that FSTs that represent natural lan- 
guage are highly sparse, with lexical FSTs being more sparse than rule based 
ones. For instance, a lexical transducer of 36,668 English words, with an al- 
phabet size of 112 symbols (including orthographic, phonetic and other gram- 
matical labels) exhibits a sparsity of 98.48%. The English orthographic rules of 
(Ritchie et ah, 1992, §D.l) produce a machine with a sparsity of 93.16%. The 
sparsity of the stress rules in (Halle and Keyser, 1971, p. 10) is 40.84%. Lan- 
guages with larger alphabets tend to be notoriously more sparse. One of our 
Mandarin FSTs, for example, has a sparsity rate of 99.96% due to the fact that 
Mandarin employs an inventory of more than 17,000 symbols. 

Many mathematical models that give rise to sparse data result in uniform 
patterns of data. Unfortunately, the sparse data that result from natural lan- 
guage FSTs lack uniformity in their pattern. Hence, one cannot make use of 
special coding algorithms that are applicable to uniform data. 



3 Compressed Storage of Sparse Automata 



As the transition matrix representation of FSAs is sparse, it is natural to ex- 
plore techniques for storing general sparse matrices. This section presents such 
a technique for representing an accepting automaton and weighted finite-state 
transducers. 
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3.1 Data Structures and Storage Schemes 

The simplest representation of an automaton is a transition matrix. Rows in the 
matrix denote states, while columns denote the entire alphabet. An entry on 
row q € Q and column s G S gives the set of next states that can be reached 
from state q on the symbol s. (In the case of transducers, each entry is a set of 
pairs (p,o), with p denoting the destination state and o the output symbol. In 
the case of weighted devices, a weight w is added.) 




(a) 



State 


FMSTW a d e hino r stu y 


0 


5 2 13 4 


1 


6 7 


2 


7 


3 


8 9 


4 


10 


16 


9 


17 


13 


18 


19 


19 


20 


20 





(b) 



Fig. 4. An automaton for the days of the week with its transition matrix. The final 
state (20) is marked in bold 



Consider an automaton that accepts the strings representing the days of the 
week. Fig. ED a) gives the transition diagram, with the corresponding transition 
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matrix in Fig. Hb). The advantage of this representation is its random access: 
once in a particular state is faced with an input symbol, it takes one lookup 
operation into the table to determine the destination state. However, when ma- 
trices become notoriously sparse, especially with the sparsity figures mentioned 
in section E21 such a representation will be deemed infeasible from a space com- 
plexity point of view. The matrix in Fig. Hb) requires |Q| x IS*! storage space. 
Its sparsity is 1 — 2 ix %2 ~ 98.985% (122 is the total number of symbols in the 
English text-analysis system). 

3.2 Row-Indexed Storage 

Instead of a matrix, an accepting automaton is represented by three arrays. The 
first array. A, stores the entries from the full matrix representation. For example, 
the entries from Fig. Eb) are stored in array H of Fig. 13.21 row- wise. The second 
array, Ag, stores the indices in A where the entries for each state begins. For 
example, the entries of state 1 begin in position 5 in A; hence, Aq\l] = 5 (we 
use this notation to state that element 1 in array Aq is 5) . In a similar fashion, 
the entries of state 3 begin in position 8 in A] hence, H,j[3] = 8. Finally, for each 
entry in H, the third array As stores the corresponding symbol from the header 
ofFig.iKb). For example, the first entry in A, viz. 5, is in the column marked F 
in Fig. Eb) ; hence, %s[0] = F. 



index k 


0 


1 


2 
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8 


9 


10 
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a 
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Fig. 5. Row-indexed representation of a sparse automaton. 



In terms of space complexity, the length of A equals the number of arcs in the 
machine, |5|; the length of Aq equals the number of states, \Q\; and the length 
of As equals that of A. Hence, the space complexity of this representation is 
2|(5| -|- \Q\, where |<5| is the number of non-zero entries in the original matrix and 
\Q\ is the number of states. Note that the number of symbols employed, 
does not feature in this complexity. 

As for time complexity, consider the following algorithm which returns the 
destination state given the current state q and the input symbol s. 

Destination((j, s) 

1 fci •<— Aq[q] 

2 k2 ^ -^q [Q + 1 ] 

3 while ki < do 

4 if As[fci] = s then 



5 

6 
7 
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return A[ki] 
end 

k\ — k\ ~\~ 1 

8 end 

9 return fail 



Lines 1 and 2 assign to k\ and k 2 the indices in A where the transitions for 
state q begin and terminate, respectively. Lines 3-8 iterate over the transitions. 
Line 4 checks if the current transition is on the symbol s. If so, line 5 returns the 
destination state. If the iteration fails to find a transition on s, the algorithm 
returns a fail (which can be interpreted as a “dead state”). 



Table 2. Arcs to state ratios for surface to lexical transducers. 





A 


B 


Language 


Arcs/ State 


IQI : |5| 


German 


3.5 


132343 : 244 


French 


3.8 


94257 : 288 


Mandarin 


4.9 


45813 : 17027 



Let a represent the time it takes to access an array with a known index; the 
above algorithm requires (3 -I- ^2 — ki)a if successful and (2 -|- k 2 — ki)a oth- 
erwise. The worst case complexity takes place when requesting the destination 
on the last symbol of the alphabet from the state with the most arcs. How- 
ever, considering that the ratio of arcs per state is small in natural language 
machines, especially lexical FSTs, /c 2 — is always small. Table El Column A, 
gives empirical values from our inventory of surface-lexical transducers. 

If a situation arises where the ratio of arcs per state is high (see l ld.dI helowL 
lines 3-8 in the above algorithm can be replaced with a more efficient search, 
e.g., binary search if the contents of Ag between ki and k 2 are sorted. 

3.3 Row-Indexed Storage with Transpose 

It was noted above that the space complexity above is 2|5| -I- |Q| . In language and 
speech transducers, it is usually the case that |5| <C |Q| as shown in Table El Col- 
umn B. Hence, further compression can be achieved by employing row-indexed 
storage on the transpose of an original matrix M. 

For machines where \Q\ is large there is a serious drawback in time com- 
plexity using this method in that /c 2 ~ k\ will be very large. In such a case, a 
binary search algorithm need to replace lines 3-8 of the algorithm as indicated 
above. Despite this drawback, this method may prove useful for special-purpose 
hardware devices where there is a real limitation in memory, but with a lot of 
“horse power.” 
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3.4 Storing Weighted Finite-State Transducers 

A weighted finite state transducer is a traditional transducer except that both 
arcs and final states are associated with costs (Mohri, Pereira, and Riley, 1998). 

The same mechanism described above is used to compress transducers, albeit 
with some modification to cater for the symbols on the second tape. Here, tran- 
sition are represented with a 2-dimensional matrix, i.e., a 3-dimensional array. 
The extra dimension is needed to store the symbols on the second tape of the 
transducer, i.e., the output symbols. The 2-dimensional matrix consists of four 
vectors: A, Aq and Ag are as above with an additional vector, Ag, of the same 
size as Ag to store output symbols. 

In the case of weighted transducers, the weights need to be represented as 
well. An additional vector, A^, is used to store weights on transitions. Its size is 
equal to the size of Ag. 



3.5 Storing Final States 

The representation of the machine needs to indicate which states are final. In the 
original Bell Labs implementation, final states have non-zero costs (final states 
with cost 0 are coded with a special constant N0_CDST value). In the compressed 
version, final states are indicated by a bit vector whose size is [S'] mod 8-1-1 
octets. If a state i is final, then the ith bit in the array is set; otherwise, it is 
clear. 



4 Further Compression Methods 

Although the above method for storing sparse matrices results in good com- 
pression rates, much compression can be achieved by implementing a number of 
complementary ideas. 



4.1 Using Every Bit 

The underlying storage mechanism makes use of a data structure that stores an 
element using one, two, three or four bytes. The choice depends on the maximum 
value of a particular data type. The alphabet of our German FSTs is 244 sym- 
bols and can be coded using eight bits (one octet). The French system, on the 
other hand, has an alphabet of 288 symbols for which nine bits (two octets) are 
required. 

One can get additional savings by pairing the input and output symbols. 
While this will not help in the case of German, tremendous savings can be 
achieved in the case of French: 16 bits (two octets), instead of 18 (three octets), 
will be sufficient to code pairs of symbols. Similar techniques are used throughout 
the implementation which yield almost double the compression rate. 
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4.2 Lossless Removal of Transitions 

While the ratio of arcs to states for lexical-based FSTs is quite low (see Table |21 
Column B), this is not the case in rule-based FSTs. Consider the [e]-deletion 
rule, depicted in its FST representation in Fig. 12 . n There are IS"! — 1 arcs from 
state 0 to itself on all symbols of the alphabet apart from ‘v’. If the alphabet 
size is that of the English alphabet (in addition to punctuation symbols), this 
FST would contain over 350 arcs. 

Tremendous savings can be achieved by defining an ‘other’ symbol, denoted 
here by o, which was used in (Kay and Kaplan, 1983) and (Koskenniemi, 1983) 
as well as many current implementations. Hence, the liSI — 1 transitions from 
state 0 to itself will be replaced by one transition on o:o. Transitions such as 
those from state 6 to 0 pose a problem; here, the transition is on “other, but 
not g” . This can be catered for by adding a “dead state” to the machine with a 
transition from state 6 to it on g:g. Then one can have a transition from state 
6 to 0 on o:o. Care must be taken with the new machine, e.g., not to apply 
standard minimisation algorithms to it as the dead states will then be removed 
changing the expressiveness of the machine. An additional 10% of space saving 
can be achieved using this lossless removal of transitions (only 10% because not 
all FSTs are rule-based). 

Even with the use of the ‘other’ symbol, further reduction is size can be 
achieved by algorithmically removing arcs from machines. Consider a text-to- 
phoneme that employs a number of FSTs in cascade. The FSTs at the beginning 
of the cascade are likely to deal with normalization issues. This input, as well 
as output, alphabets are probably the set of orthographic symbols. The FSTs 
at the other end of the cascade probably deal with only phonetic symbols and 
markers. However, rule compilers add backtracking arcs in all machines using 
the entire alphabet of the system (including orthographic, phonetic and other 
tagging symbols) . Say the alphabet of the entire text-to-phoneme system makes 
use of markers such as sing, pi, these for sure will be used only in intermediate 
machines and can easily be removed from the FSTs at the font-end of the cascade. 

The grammar writer can designate the set of real input symbols to the first 
FST in the cascade, say Uq- Performing the operation Id{SQ) o FSTq (where 
Id is the identity operator and FSTq is the first transducer in the cascade) 
will remove all unnecessary arcs in FSTq. Iterating over the arcs of the result, 
one computers the output alphabet of the new transducer. Si. Performing the 
operation Id{Si) o FSTi will reduce the size of the next transducer, etc. 

4.3 Extracting Weights 

It was mentioned that weighted transducers associate a cost for each arc and for 
each final state. As costs are represented by a floating point number, this requires 
huge storage for large machines. One way to minimize this is by “pushing” the 
costs of all final states onto their incoming arcs. For example, say a final state is 
associated with a cost of 1.0 and has two incoming arcs with costs 2.0 and 3.0, 
respectively. The cost of the state can be added onto the costs of the incoming 
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arcs. The result is freeing the final state from any cost value and assigning the 
costs of the incoming arcs to 3.0 and 4.0, respectively. 

Additionally, it is usually the case that grammar writers use a limited number 
of unique weights in lexica and grammars. The entire German surface-lexical 
transducer employs 33 unique costs, and the French one employs a mere 12. 
These costs can be listed in an external vector and arcs will have indices into 
the vector. Instead of having a floating point value associated with each arc of 
the French FSTs, for example, 4 bits will suffice to index the 12 unique costs. 



4.4 Storing More Information than Necessary! 

Combining all of the above complementary methods saves a lot of space, but 
adds to the time complexity of accessing data. Our compression gives the user 
the option of storing in the machines information that is accessed heavily even 
though it can be computed at run time. For example, the composition algorithm 
(as well as others) ask of each state the numbers of transitions on e as input and 
as output. This information can be computed at compile time and stored in the 
machine. Since the average of e-containing transitions per state is low (input e 
is 0.4 transitions per state on average), a few bits can be used for storing such 
information with a high gain in performance. 

Caching mechanisms can also be used to keep the latest information in mem- 
ory. Some of the machines that we implemented with caching perform faster 
than the original uncompressed machines. 



5 Conclusion and Results 

The above complementary methods were applied on the language modeling FSTs 
for multi-lingual TTS. The data from Table QI is repeated in Table 0 with the 
compression results. For comparison purposes, the results of compressing the 
machines with Unix’s compress are included (note that the machines cannot be 
accessed in this format). 



Table 3. Storage requirement for letter-to-sound rules in the Bell Labs multilingual 
TTS system 



Language 


Modules 


Size (MB) 


Unix Compress 


Our Compression (MB) 


German 


49 


26.6 


5.7 


5.3 


French 


52 


30.0 


6.0 


4.6 


Mandarin 


51 


39.0 


13.2 


12.9 



To illustrate access time, three German FSTs were composed with each other. 
Say that the composition on the original non-compressed machines takes 100% 
time. Applying the composition on the compressed versions of the machines 
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(with 83% compression rate) requires 155% of the original time. If caching is 
enabled, however, the composition runs faster in 63% of the original time. 

This paper outlined an eclectic method for compressing large natural lan- 
guage FSTs. Various complementary methods were employed yielding a high 
compression rate with minimal impact in speed. 

Other related work include (Roche and Schabes, 1995) who mention using an 
algorithm by (Tarjan and Yao, 1979) to represent a sparse matrix. An anony- 
mous reviewer kindly pointed out a work by (Liang, 1983) which I have no access 
to. 

Acknowledgments. This work benefited from many discussions with Richard 
Sproat who contributed especially to the material in S £14.21 a.nd 14., 'll Thanks are 
due to Martin Jansche and Victor Faubert for useful comments. 
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Abstract. Finite-state techniques are widely used in various areas of 
Natural Language Processing (NLP). As Kaplan and Kay |1 have ar- 
gued, regular expressions are the appropriate level of abstraction for 
thinking about finite-state languages and finite-state relations. More 
complex finite-state operations (such as contexted replacement) are de- 
fined on the basis of basic operations (such as Kleene closure, comple- 
mentation, composition). 

In order to be able to experiment with such complex finite-state oper- 
ations the FSA Utilities (version 5) provides an extendible regular ex- 
pression compiler. The paper discusses the regular expression operations 
provided by the compiler, and the possibilities to create new regular ex- 
pression operators. The benefits of such an extendible regular expression 
compiler are illustrated with a number of examples taken from recent 
publications in the area of finite-state approaches to NLP. 



1 Introduction 

Finite-state techniques are widely used in various areas of Natural Language 
Processing (NLP). As Kaplan and Kay m have argued, regular expressions are 
the appropriate level of abstraction for thinking about finite-state languages and 
finite-state relations. More complex finite-state operations (such as contexted 
replacement) are defined on the basis of basic operations (such as Kleene closure, 
complementation, composition). 

For instance, context sensitive rewrite rules have been widely used in several 
areas of natural language processing, including syntax, phonology and speech 
processing. Johnson HH has shown that such rewrite rules are equivalent to 
finite state transducers under the assumption that they are not allowed to rewrite 
their own output. An algorithm for compilation into transducers was provided by 
Kaplan and Kay H2! Improvements and extensions to this algorithm have been 
provided by Karttunen IE] US] HI and Mohri & Sproat m- Such algorithms 
take as their input regular expressions for the strings to be replaced and the left 
and right contexts, and produce a finite-state transducer. In other words, such 
an algorithm provides a new regular expression operator. 

Many different variants of replacement operators have been proposed, de- 
pending on whether rewrite rules are interpreted left to right, right to left or in 
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parallel; whether rewrite rules are required to use longest, shortest or all matches; 
whether rules are obligatory or optional; whether contexts should match the in- 
put side or the output side of the transductions etc. For this reason, it is crucial 
to be able to experiment with each of the various proposals in a flexible way. 

Version 5 of the FSA Utilities pni is an extended, rewritten and redesigned 
version of the FSA Utilities toolbox previously presented at the first WIA m- 
The FSA Utilities toolbox has been developed as a platform for experimenting 
with finite-state approaches in natural language processing. For this reason, the 
FSA Utilities toolbox is implemented in SICStus Prolog (cf. also section El) . 

FSA5 provides a very flexible extendible regular expression compiler. Below, 
we present the basic regular expression operations provided by the compiler, 
and the possibilities to create new regular expression operators. We illustrate 
the exendible regular expression compiler with a number of examples taken from 
recent publications in the area of finite-state approaches to NLP. 

2 Regular Expressions 

Table □ gives an overview of the basic regular expression operators provided 
by FSA5. Apart from the standard regular expression operators and extended 
regular expression operators for regular languages, the tool-box also provides 
regular expression operators for regular relations. For example, the expression 

{a:b,b:c,c:a}* (1) 

is the transducer which rewrites each a into a b, each b into a c, and each c into 
an a. Consider furthermore a transducer which removes each b, but which leaves 
each non-b in place: 



Expr denoting a regular language is automatically coerced in the context in 
which a transducer is expected into identity (Expr) . Here, ? -b is automat- 
ically coerced into identity (? -b), because it is unioned with a transducer. 
Composing the examples E and El 



yields a transducer which removes each a, and transduces each b to a c, and 
each c to an a. For instance, the input abcabcabc yields cacaca. 

In FSA5, such a regular expression could be turned into a transducer using 
the command: 



{b: [] ,? -b>* 



( 2 ) 




{a:b,b:c,c:a}* o {b:[],? -b}* 



( 3 ) 



°/o fsa -r ’{a:b,b:c,c:a}* o {b:[],? -b}*’ > exl.fa 



( 4 ) 



1 



For technical reasons a space is required after each occurrence of the ? meta-symbol. 
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Table 1. Basic regular expression operators in FSA5. 
[] empty string 

[El , E2 , . . . En] concatenation of El , E2 . . . En 
{} empty langnage 

■[El , E2 , . . . En} union of El , E2 . . . En 



E* 


Kleene closure 


E* 


optionality 


~E 


complement 


E1-E2 


difference 


$ E 


containment 


El & E2 


intersection 


? 


any symbol 


A:B 


pair 


El X E2 


cross-product 


A 0 B 


composition 


domain(E) 


domain of a transduction 


range (E) 


range of a transduction 


identity (E) 


identity transduction 


inverse (E) 


inverse transduction 



In this case, the resulting automaton is written to the file exl .fa in FSA5 for- 
mat. There are options to produce automata in many different formats, includ- 
ing formats for other finite-automata tool-boxes such as AT&T’s fsm program 
[11 and various visualization formats (including dot, vcg, daVinci, IAT(; 5 X and 
postscript). Other interesting formats are as a Prolog or C program imple- 
menting the transduction. 

FSA5 can also be used interactively. In that case a graphical user interface is 
provided from which regular expressions can be input. The resulting automata 
are then displayed on the screen, and the resulting automata can be tested with 
sample inputs. The availability of such a graphical user interface in combination 
with various visualization tools has enabled the use of FSA5 in teaching |2j. 
For more information on these and other possibilities refer to the FSA Home 
Page: http://www.let.rug.nl/vEuinoord/Fsa/. The FSA Home Page includes 
an on-line demo. 

3 Extendible Regular Expression Operators 

The regular expression compiler can be extended with new regular expression 
operators by providing one or more files defining these operators. The definitions 
are essentially of two types. In both cases, the actual definitions are written in 
(often very simple) Prolog. On the one hand, operators can be defined in terms 
of existing regular expression operators. On the other hand, regular expression 
operators can be defined by providing a direct implementation on the underlying 
automata. Many researchers prefer the first style. For instance, Kaplan & Kay 
|C3 (P- 376) argue: 
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The common data structures that our programs manipulate are clearly 
states, transitions, labels, and label pairs — the building blocks of finite 
automata and transducers. But many of our initial mistakes and fail- 
ures arose from attempting also to think in terms of these objects. The 
automata required to implement even the simplest examples are large 
and involve considerable subtlety for their construction. To view them 
from the perspective of states and transitions is much like predicting 
weather patterns by studying the movements of atoms and molecules or 
inverting a matrix with a Turing machine. The only hope of success in 
this domain lies in developing an appropriate set of high-level algebraic 
operators for reasoning about languages and relations and for justifying 
a corresponding set of operators and automata for computation. 

Paradoxically, Mohri & Sproat improve upon Kaplan & Kay’s algorithm by 
taking precisely the opposite approach. Their algorithm is primarily presented 
in terms of manipulations upon states and transitions within automata. One 
could perhaps translate Mohri & Sproat ’s algorithm into a high-level calculus, 
but a great deal of efficiency would be lost in the process. It is a testimony to 
the flexibility of FSA5, that these two approaches can both be implemented and 
combined (cf. section ^31 • 

New operators in terms of existing operators. A regular expression operator is 
defined as a pair macro(ExprA,ExprB) which indicates that the regular expres- 
sion ExprA is to be interpreted as regular expression ExprB. For example, simple 
nullary regular expression operators (equivalent to abbreviatory devices found 
in tools such as lex and flex), can be defined as in the following example: 

macro ( vowel, {a,e,i,o,u} ) (5) 

indicating that the operator vowel/0 can be understood by assuming that every 
occurrence of vowel in a regular expression is textually replaced by {a , e , i , o , u}. 

The same mechanism is used to define n-ary operators, exploiting Prolog 
variables. For instance, the containment operator containment (Expr) is the set 
of all strings which have as a sub-string any of the strings in Expr. This could 
be defined as follows 0 

macro (containment (Expr) , [? *,Expr,? *] ) (6) 

Naturally, operators defined in this way can be part of the definition of other 
operators. For instance, the operator free (A) is the language of all strings which 
do not have any of the strings in A as a substring. This can be defined as: 

macro (free (A) , 'containment (A) ) (7) 

We have found it useful to define boolean operators using this mechanism. In 
fact, if we use the universal language to stand for true and the empty language to 

^ Note that this operator is standardly available in FSA5. Many of the built-in oper- 
ators in FSA5 are defined using the same technique. 
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stand for false, then the standard operators for intersection and union correspond 
to conjunction and disjunction: 

macro (true , ? *) . (8) 

macro (false , {}) . 

With these definitions we get the expected properties: 



true & true = true 
true & false = false 
false & true = false 
false & false = false 



{true, true} = true 
{true, false} = true 
{false, true} = true 
{false, false} = false 



( 9 ) 



The macros for true and false can also be used to define a conditional expres- 
sion in the calculus. The operator coerce_to_booleEin maps the empty language 
to the empty language, and any non-empty language to the universal language: 



macro(coerce_to_boolecUi(E) , (10) 

range (E o (true x true))). 

macro (if (Cond, Then, Else) , 

{ coerce_to_boolean(Cond) o Then, 

~coerce_to_boolean(Cond) o Else }) . 

Various interesting properties of automata have been implemented which yield 
boolean values, such as the predicates is-equivalent/2 for recognizers, and 
is Junctional/ 1 and issubsequential/ 1 for transducers (using the algorithms 
described in for instance I2SI). 



Regular expression operator definitions can also be recursive. The follow- 
ing example demonstrates furthermore that definitions can take the operands 
of the operator into account. The operator set (List) yields the union of the 
languages given by each of the expressions in the list List; union (A, B) is a 
built-in operator providing the union of the two languages A and B: 

macro(set( []),’{}’) . (11) 

macro(set( [H|T] ) ,union(H,set(T))) . 

We can also exploit the fact that these definitions are directly interpreted 
in Prolog by providing Prolog constraints on such rules. This possibility is used 
in |7| to define a longest-match concatenation operator which implements the 
leftmost-longest capture semantics required by the POSIX standard (cf. sec- 
tion 14.411 . 

A simple example is a generalization of the operator free. Suppose we want 
to define an operator free(N,Expr) indicating the set of strings which do not 
contain more than N occurrences of Expr. This can be done as follows: 



An Extendible Regular Expression Compiler for Finite-State Approaches 127 



macro (f ree (N, X) , ~ [? *|List]) (12) 

free_list(N,X,List) . 

free_list(0,X, [X,? *] ) . 

free_list(NO,X, [X,? *|T]) 

NO > 0, N is NO-1, free_list(N,X,T) . 

Another example is an implementation of the N-queens problem: how to place 
N queens on an N by N chess-board in such a way that no queen attacks any 
other queen. For any N we can create a regular expression generating exactly all 
strings of solutions. A solution to the N-queen problem is represented as a string 
of N integers between 1 and N. An integer i at position j in this string indicates 
that a queen is placed on the i-th column of the j-th row. 

macro (n_queens (N) , sigma(N)* (13) 

& length (N) 

& columns (N) 

& diagonals (N) 

& reverse (diagonals (N) ) ) 

The operator n_queens(N) is defined as the intersection of a number of con- 
straints. The first constraint, sigma (N)*, indicates that a solution must be a 
string of integers between 1 and N. The second constraint indicates that the 
length of the string must be N. The remaining constraints ensure that queens 
do not attack each other. The definition of length illustrates once more the use 
of Prolog to create a regular expression; the definition of sigma/1 uses the set 
operator defined previously. 

macrodength(N) .List) length(List ,N) , fill_qm(List) . (14) 

f ill_qm( [] ) . 

fill_qm([? I T] ) fill_qm(T) . 

macro (sigma(N) ,set(L)) 

f indall(C,between(l ,N,C) ,L) . 

between(N,_,N) . 

between(N0,N,l) 

N1 is NO+1, N1 < N+1, between (Nl, N, 1) . 

The complete program is given in the appendix. For instance, the expression 
n_queens(5) produces the automaton in figure ^ 

The mechanism described sofar to define new regular expression operators is 
already quite powerful. As another illustration, consider the problem of compiling 
a given finite automaton into a regular expression. This problem becomes trivial 
if we allow the introduction of new operators. Here is the definition of an operator 
‘fa/5’ which describes an automaton as a listing of its components: 
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macro (fa (Sigma, States , Initials .Finals .Transitions) , (15) 

range ( state-sym-state triples: 

( [ [States , Sigma] * , States] 

& y« no non-transition triples: 
free ( [States , Sigma, States] -Trainsitions) 

& y« start in start-state: 

[Initials,? *] 

& y« end in final state: 

[? *, Finals] 

) o °/o get rid of state names: 

[[States X [],?]*, States x [] ] 

)) 




Fig. 1. Solution to the 5-queens problem 



As an example, the automaton given in figure 2.16 of jl l)j (given in figure EJ 
can be specified as: 

f a({0 , 1} , {ql , q2 , q3> , {ql> , {q2 , qS} , 

{ [ql , 0 , q2] , [ql , 1 , qS] , [q2 , 0 , ql] , 

[q2,l,q3] , [q3,0,q2] , [q3,l,q2]}) 



( 16 ) 
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Fig. 2. Example automaton from figure 2.16 of m- 



Direct implementation of new operators. Some operators are more easily defined 
in terms of the underlying automaton. For instance, the operator reverse (X) 
is the set of all strings Y such that the reversal of Y is in X. In case the operand 
X is constructed by means of standard regular expression operators it would be 
possible to define the reverse operator recursively in terms of the various forms 
that X can take; in FSA however X could be constructed by means of various user- 
defined operators as well. Therefore this approach is not applicable. However, 
the operation is trivial to define in terms of the underlying automaton: each of 
the transitions needs to be swapped, final states become start states and vice 
versa. The (simplified) definition is given as follows: 

rx(reverse(Expr) ,Fa) :- (17) 

fsa_regex:rx(Expr,FaO) , reverse_fa(FaO,Fa) . 

reverse_fa(FaO,Fa) :- 

fsa_data: start_states(FaO, Finals) , 
f sa_data: final_states(FaO .Starts) , 
fsa_data:transitions(FaO,TrcUisO) , 
reverse_transitions (TrainsO , Trans) , 
f sa_data: construct_f a (St arts .Finals .Trans .Fa) . 

reverse_trEins ( [].[]). 

reverse_trEins ( [trans(A.B.C) I TO] . [trans(C.B.A) |T] ) :- 
reverse_trans (TO .T) . 

As is typical in such definitions, the fsa_regex:rx predicate is used to construct 
an automaton for a given regular expression. The fsa_data module provides a 
consistent interface to the internal representation of automata. Its predicates 
can be used to select relevant parts of an automaton (such as start states, final 
states and transitions) and to construct automata on the basis of such parts. 

4 Regular Expression Operators in NLP 

This section illustrates the flexibility and the power of the FSA5 extendible 
regular expression compiler on the basis of a number of examples taken from 
recent publications in the field of NLP. 
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4.1 Lenient Composition 

In a recent paper, Karttunen m has provided a new formalization of Optimal- 
ity Theory in terms of regular expressions. Optimality theory m is a frame- 
work for the description of phonological regularities which abandons rewrite 
rules. Instead, a universal function called GEN is proposed which maps input 
strings non-deterministically to many different output strings. In addition, a set 
of ranked universal constraints rule out many of the phonological representations 
generated by GEN. Some constraints can be conflicting. Therefore it might be 
impossible for a candidate string to satisfy all constraints. A string is allowed 
to violate a constraint as long as there is no other string which does not violate 
that constraint. 

Procedurally, this mechanism can be understood as follows. Firstly, an in- 
put is mapped to a set of candidate output strings. This set of strings is then 
passed on to the most important constraint. This constraint removes many of 
the candidate strings. The remaining strings are passed on to the next impor- 
tant constraint, and so on. If the application of the constraint would remove all 
remaining candidate strings, then no strings are removed (constraints are vio- 
lable). In the simplest case, only a single string survives all of the constraints. If 
none of the strings satisfy a given constraint, then the strings survive with the 
least number of violations of that constraint. 

Karttunen formalizes GEN as a regular relation. Each of the constraints is it- 
self a regular language allowing only the strings which satisfy the constraint (un- 
less no strings satisfy the constraint). If the constraints were to be combined using 
ordinary composition, then the set of outputs would often be empty. Therefore, 
instead of composition Karttunen introduces an operation of lenient_composition 
which is closely related to a notion of defaults. 

Informally, the lenient-Composition of S and C is the composition of S and C, 
except for those elements in the domain of S that are not mapped to anything 
by S o C. Thus, it enforces the constraint C to those strings in S which have an 
output that satisfies the constraint: 

macro (priority_union(Q, R) , {Q, ~domain(Q) o R}) . (18) 

macro (lenient_composition(S, C) , priority_union(S o C,S)). 

Here, priority_union of two transductions Q and R is defined as the union of 
Q and the composition of the complement of the domain of Q with R; i.e. we 
obtain all pairs from Q, and moreover for all elements not in the domain of Q 
we apply R. Lenient composition of S and C is defined as the priority union of 
the composition of S and C (on the one hand) and S (on the other hand); i.e. 
we obtain the composition of S and C and moreover for all inputs for which that 
composition is empty we retain S. 

Consider the example 



lenient_composition({b x [b,b] ,a x [b,b] *}, [b,b,b] *) 



(19) 
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The input transducer maps an a to an even number of b’s, and it maps a b to 
two b’s. If this transducer is leniently composed with the requirement that the 
result must be a string of b’s divisible by 3, then the resulting transducer maps 
b to two b’s, as before (since the constraint cannot be satisfied for any map of 
the input b), and it maps an a to a string of b’s which is divisible by 6. 

Karttunen illustrates the method by providing a formalization of the syllabi- 
fication analysis in Optimality Theory. This formalization has been implemented 
in FSA5 and is given in the appendix. 



4.2 Priority Union for Lexical Analysis 

Another application of the priority union operator is in spell checking. As in P] 
we consider a finite-automaton approach. Suppose we are given a dictionary in 
the form of a transducer. The transducer will map each word to its lexicographic 
description. A spell checker attempts to find, for a given word, the lexicographic 
description of the word which is closest to a word in the dictionary according 
to some distance function. As in many spell checkers we assume Levenshtein 
distance: the minumum number of substitutions, deletions and insertions that 
is required to map a string into another. In FSA all strings with a Levenshtein 
distance of 1 can be defined as follows; here X can be thought of as the dictionary, 
levl (X) is the Levenshtein-1 closure of the dictionary: 

macro (levl (X) , { subs(X), del(X), ins(X) }) (20) 

The operators subs/1, del/1 and ins/1 are built-in. The expression subs(X) 
stands for all pairs {x,y) such that (x',y) is in the relation defined by X and 
x' can be formed from x hy & single substitution. The insertion and deletion 
operators are defined likewise. 

In contrast to Pj we want to obtain the candidates with minimal distance. For 
instance, if we attempt to lookup book then we don’t want to get the description 
of cook as a result. This can be defined using the priority union operator as 
follows: 

macro (spelll (X) , priority_union(X, levl(X))) (21) 

For instance, applying spelll to a dictionary consisting of the identity trans- 
ducer over the words book, look, lock, oak would map each of these words to 
itself, and in addition it would map a form such as wook to the set book, look 
and a form such as ook to the set book, look, oak. 

We can define expressions for any given radius a. For example, the case which 
treats a = 2 is given by: 



macro (spell2 (X) , priority_union(spelll (X) , levl (levl (X) )) ) (22) 
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4.3 The Replace Operator 

In IlHI a variant of the replace operator is implemented which is more efficient 
than previous implementations provided by Kaplan and Kay H2I and Karttunen 
P3- This improved version crucially depends on the possibility of manipulating 
the transitions and states of the underlying automata directly. The replace- 
ment of expression Phi into Psi in the context of Left and Right is written 
replace (Left, Phi, Psi, Right). In the left-to-right interpretation, this opera- 
tor can be defined as the following cascade: 

macro (replace (L, Phi, Psi, R) , (23) 

r(R) o f(Phi) o replace (Phi, Psi) o 11(L) o 12(L)) 

This definition and the definitions of the auxiliary operators are closely modelled 
on those given in m- The auxiliary operators are defined in the appendix. 

A typical example of the use of the replace operator is provided by the past 
tense endings of Dutch regular verbs. In Dutch, the singular past tense is formed 
by the -de and -te suffixes. If the previous phoneme is voiced, the suffix -de must 
be used; in order circumstances the -te suffix is appropriate. This phenomenon 
can be analysed by assuming an underlying, abstract, -Te suffix. The T is then 
transformed into a d or t depending on context. The rule can be defined as 
follows (the + indicates a morpheme boundary) : 

macro (tkofschip, (24) 

replace! [{k,f ,s, [c,h] ,p,t,x},+] , ’T’ ,t,e) 
o 

replace(+, ’T’ ,d, [] ) 

4.4 Leftmost-Longest Contexted Replacement 

In a leftmost-longest match contexted replacement operator 

1ml (T , Lef t , Right ) 

is defined which ensures that the transducer T is applied in contexts Left and 
Right, using a leftmost-longest match strategy. One application of such an op- 
erator is finite-state parsing (chunking), I1I5I8I22I . In finite-state parsing, sets of 
context-free rules are collected into levels. Typically there is a finite number of 
such levels, and these levels are ordered. First each of the rules of the first level 
apply. The result is then input to the second level, etc. Note that rules cannot 
work on their own output, unless the same rule is placed in several levels. 

In the following example we will not use the contexts; therefore 1ml/ 1 is 
defined as: 

macro! 1ml (T), 1ml (!,[],[]) ) (25) 

This operator ensures that the transducer T is applied to a string at all possible 
positions, using a left-to-right left-most longest match policy. 
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In this particular example we will assume that the input to the finite-state 
parser is a tagged sentence: each word is represented by a category, an opening 
bracket, the word itself, and a closing bracket. A rule with a given left hand 
side and right hand side will look for the sequence of elements described by the 
right hand side and wrap the result inside left hand side brackets. In general, 
the macro — > can be defined as the following transducer (this macro disallows 
the case that the daughters is the empty string): 

macro ((A — > Ds) , [ [] x [A,’[’], Ds-[], []:’]’]) (26) 

We use the macro d(Expr) for elements in the right hand side of rules; the 
macro dw(Expr) is similar but is used for pre-terminals, to refer to specific words. 

macro(d(Cat) , [Cat, ,free . (27) 

macro (dw (Cat, Word) , [Cat, ’[’:’(’, Word, 

Note that the brackets (introduced by an earlier level) are replaced here by 
other brackets in order to ensure that these brackets cannot be used in later 
levels again; in other words at any given level we can only ‘see’ the top-most 
constituents (yet, the full parse tree can be recoved using the ‘invisible’ brackets). 

Using these two macro’s a rule to recognize basic noun phrases is: 

np — > [d(art) ~ ,d(num) " ,d(adj ) * ,d(n)+] (28) 

A level of rules can now simply be defined as the replacement operator applied to 
the union of these rules. For instance, the following is a level recognizing multi- 
word-units (for instance, the Dutch phrase ‘ten opzichte van’ is comparable to 
the English phrase ‘with respect to’): 

macro (mwu, 1ml ({ (29) 

(p — > [dw (p, ten) ,dw(n, opzichte) ,dw(p,VcUi)] ), 

(p — > [dw(p, in) ,dw(n, verband) ,dw(p,met)] ), 

(p — > [dw(p, in) ,dw(n,plaats) ,dw(p, van)] ) })) 

Finally, we use composition to combine a number of such levels. Thus, the 
following expression defines a simple noun-phrase chunker: 

macro (np_chunker, (30) 

mwu o 1ml (( adj — > [d(adv) , d(adj)])) 

o lml(( np — > [d(art) ~ ,d(num) " ,d(adj ) * ,d(n)+] ) ) 
o lml(( pp — > [d(p) ,d(np)l ) ) 
o 1ml (( np — > [d(np) ,d(pp)+] ) ) ) 

For example, one of the sentences from the Eindhoven corpus is chunked as 
in figure 0 
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p V punc 



art n blijkt toch wel 



de kleur 



PP te zijn 



np 




art adj 



een onmisbaar bestanddeel van art n 




zo’n uitzending 



Fig. 3. Application of NP chunker 



5 Implementational Issues 

The regular expression compiler is defined in SICStus Prolog. This choice was 
motivated because the FSA Utilities toolbox has been developed as a platform 
for experimenting with finite-state approaches in natural language processing. 
Prolog allows for rapid prototyping of new techniques and variations of known 
techniques. The drawback is that the CPU-time requirements increase in 
comparison with an implementation based on C or C-|— 1-. In PE! it is shown 
that the implementation of the determinizer is typically about 2 to 5 times 
slower in FSA Utilities than in AT&T’s fsm library (PH]); the FSA Utilities 
toolbox contains a variant of the determinization algorithm for input automata 
with large amounts of e-moves. In such cases FSA Utilities is often much faster. 
The implementation of the minimization algorithm m is up to three times 
faster than the implementation described in which was shown to be much 
faster than the corresponding implementations in Fire Lite EZI and Grail m- 

Regular expressions are read and parsed using the Prolog parser (i.e. regular 
expressions are read in as Prolog terms), exploiting the inherent flexibility of 
this parser (such as the possibility to declare that new operators may be written 
using operator syntax; therefore we can write El o E2 instead of o(El,E2)). 
The constructed term is straightforwardly compiled into a corresponding finite 
automaton using a simple top-down recursive-descent procedure. 

This mechanism implies that in order to construct an automaton for a reg- 
ular expression such as [a* ,b, c~ ,d+] automata are constructed for each of the 
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sub-expressions. For regular expressions which are constructed solely using such 
simple operators, more efficient automaton construction algorithms are known. 
We have not implemented these algorithms because of the desire to be able to 
treat user-defined operators. A possible improvement could be to have the com- 
piler identify which parts of an expression are simple enough to be treated by a 
more efficient specialized algorithm. 

The compiler supports caching of sub-expressions. If the cache facility is 
switched on, then the result of each sub-expression that is encountered will be 
cached for later re-use. This can increase efficiency for the compilation of a single 
expression, but it is especially useful in an interactive session where the user 
gradually alters the regular expression; typically a large part of the expression 
remains the same and interactive response time can be much more attractive. 

The caching facility can also be used selectively. The cache (Expr) operator 
can be used to cache the result of the compilation of a specific regular expres- 
sion Expr. For instance, in example EOl the expression levl (X) is defined as 
{subs(X) , del(X) , ins(X)}. If we write instead: 

macro (levl (X) , {subs (cache (X)) , del (cache (X)) , ins (cache (X))}) (31) 
then X will be compiled only once. 

The compiler supports a number of other operators which have an effect on 
the underlying automata, but not on the corresponding language or relation. 
For instance, the operator determinize(Expr) can be used to ensure that the 
resulting automaton is determinized. Similar operators provide a simple interface 
to various minimization algorithms provided by FSA5. 

Furthermore, certain operators can be used for the sole purpose of obtain- 
ing a side-effect. One example was the cache/1 operator discussed above. The 
operator spy (Expr), for instance, can be used to request that the compiler pro- 
vides progress information on the compilation of the expression Expr (size of the 
result, and CPU-time required to obtain the result). Such progress information 
is crucial for a better understanding of the sources of complexity of particular 
expressions. 

Concluding Remarks 

We have presented the extendable regular expression compiler of FSA5. We 
have shown that the functionality and flexibility provided by the toolbox can be 
used to experiment with a variety of finite-state techniques in natural language 
processing, including applications in phonology, morphology and syntax. 
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A Syllabification in Optimality Theory 

This is the implementation of Karttunen’s formalization of syllabification in 
Optimality Theory. 

’/o°/i Karttunen’s X -> L ... R. Every X is ‘bracketed’ with L and R. 
macro (dots (X,L,R) , [[free(X), [[] x L, X, [] x R]]*, free(X)]). 

’/o°/. Karttunen’s A => L R. Every A must occur in context L _ R. 
macro(restrict(A,L,R) , ~ [? *,A,~[R,? *] ] & ~ [~ [? *,L],A,? *] ) . 
macro (cons ,{b,c,d,f,g,h,j,k,l.ni,n,p,q,r,s,t,v,w,x,z}). 
macro (lbr,{ ’ 0 [’ ) ’D[’, ’X[’, ’N[’}). 
macro (input , {cons , vowel}-*) . 

macro (parse , dots (cons , ’Dl’.’Xl’},’]’) 
o dots (vowel, {’N [’, ’X [’}-,’]’)) . 

macro (overparse , [( [] x [Ibr , ’ ] ’] ) “ .dots ({cons .vowel} , [] , [Ibr , ’] ’]“)]) . 

macro (onset ,[’ 0 [’ , cons", ’]’]). 

macro(nucleus, [’N[’ , vowel", ’]’]). 

macro (coda, [’D [’ , cons", ’]’]). 

macro(unparsed, [’X[’ , {cons .vowel} , ’]’]). 

macro (syllable_structure , ignore ( [onset" .nucleus , coda"] .unparsed)* ) . 
macro (gen, input o overparse o parse o syllable_structure) . 
macro (have_ons .restrict ( ’N [’ , onset, [] ) ) . 
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macro (nocoda, f ree ( ’ D [ ’ ) ) • 

y.y. ’parse’ is used twice in Karttunen 98; we use parsed(N) where N is 
y.y. the maximum number of occurrences of X 
macro (parsed (N) , free(N, ’X [’ ) ) . 
macro (fillnuc , free([’N[’, ’]’])). 
macro (fillons , free( [’□[’, ’]’])). 
op(403,yfx,lc) . 

macroCR Ic C, lenient_composition(R,C) ) . 

macro (syllabify, gen Ic have_ons Ic nocoda Ic fillnuc Ic parsed(O) Ic 

parsed(l) Ic parsed(2) Ic parsedO) Ic parsed(4) Ic fillons ) . 



B Mohri Sproat Replace Operator 

Implementation in FSA5 of the contexted replacement operator of m- 

macro (r (R) .reverse (marker (1 , [sigma* .reverse (R)] , [>] ) ) ) . 

macro (f (F) .reverse (marker (1 , [•Csigma,>}* , reverse ( [ignore(F, {>}■) , >] )] , 

[’<!’, ’<2’]))). 

macro (11 (L) , sloppy_ignore (marker (2 , [sigma* ,L] <2 ’:’<2’})). 

macro(12(L) ,marker(3, [sigma*, L] , ’<2’)) . 

macro (replace (Phi ,Psi) , {{sigma, ’ <2 ’ : ’ <2 ’ , > :[]}, 

[’ <1 ’ : ’ <1 ’ , ignore(Phi , { ’ <1 ’ , ’<2 ’ , > I) x Psi,> :[]]}■*). 
macro (sigma, ? - {’<1’ , ’<2’ ,>}) . 

rx(marker(Type,Expr ,C) ,Fa) 

f sa_regex : rx(identity (determinize(Expr) ) ,FaO) , mark (Type ,C,FaO , Fa) . 

markd , Ins ,FaO ,Fa) Ins: symbols to be inserted 

f sa_regex : add_symbols (Ins ,FaO,Fal) , f sa_data: symbols (Fal , Sig) , 
f sa_data: start_states (Fal , Starts) , f sa_dat a: transitions (Fal ,TrsO) , 
f sa_data: f inal_states (Fal .Fins) , f sa_data: all_states (Fal , AllSts) , 

ordsets : ord_subtract (AllSts .Fins .NFinsO) , 
add_ins (Fins , Ins ,NFins .NFinsO, Trs ,Trsl) , 
replace_trs_sf (TrsO.Trsl ,FaO) , 

fsa_data:rename_fa(Sig, Starts, NFins, Trs, [] ,Fa) . 
replace_trs_sf ([],[],_). 

replace_trs_sf ( [trans(A0 ,B ,C) I TO] , [trans (A,B,C) |T] ,Fa) 

( fsa_data:final_state(Fa,AO) -> A=q(A0) ; A=A0 ), 
replace_trs_sf (TO , T , Fa) . 

add_ins( [] ,_,F,F) — > [] . 

add_ins ( [FO I Fs] , Ins , [q(F0) I NewFO] ,NewF) — > 

add_insO(Ins ,F0) , add_ins(Fs , Ins , NewFO ,NewF) . 

add_ins0 ( [] , _F) — > [] . 

add_ins0 ( [Sym I Syms] ,F) — > [trans(F, [] /Sym,q(F))] , add_insO(Syms,F) . 



mark(2,Del,FaO,Fa) Sym is a symbol to be deleted 
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f sa_regex : add_symbols ( [Del] ,FaO,Fal) , 

f sa_data: copy_f a_except (transitions ,Fal ,Fa2 ,TrsO ,Trs) , 
f sa_data: copy_f a_except (f inal_states ,Fa2 ,Fa,Fins , AllSts) , 
f sa_data:all_states(FaO, AllSts) , 

add_deletions (Fins , Del jTrsl jTrsO) , sort (Trsl ,Trs) . 
add_deletions ( [] , _) — > [] . 

add_deletions ( [F I Fs] ,Del) — > [tr£ms(F,Del/ [] ,F)] , add_deletions(Fs,Del) . 

mark(3,Del,FaO,Fa) °/,’/, Del is a symbol to be deleted 
f sa_regex : add_symbols ( [Del] ,FaO,Fal) , 

f sa_data: copy_f a_except (transitions ,Fal ,Fa2 ,TrsO ,Trs) , 
f sa_data: copy_f a_except (f inal_states ,Fa2 , Fa, Fins , AllSts) , 
f sa_data: all_states (FaO , AllSts) , 
ordsets : ord_subtract (AllSts , Fins ,NonF ins) , 
add_deletions (NonFins , Del , Trsl ,TrsO) , sort (Trsl ,Trs) . 

y.y. As defined by Mohri & Sproat . This should be done differently, 
y.’/o ignore is not defined for transducers. 
macro(sloppy_ignore(A,B) , ignoreO(A,B) ) . 



C N-Queens Problem 

macro (free (Expr) , "containment (Expr) ) . 

macro (sigma (N) , set (L) ) : - f indall(C,f sa_util : between (1 ,N,C) ,L) . 
macro (columns (N) , Ints) columns (1 ,N, Ints) . 

y.y. don’t use ordinary operator syntax, since this file is read-in with 

y.y. regular expression operator precedences active. 

columns (N, N, f ree ( [N,? *,N])). 

columns (NO, N,free( [NO,? *,N0]) & Ints) 

N0<N, is(Nl, + (N0,D) , columns(Nl,N,Ints) . 

macro (diagonals (N) , I) diagonals (1 ,N, I) . 

diagonals(NO,N,I) is (N,N0+1) , ! , diagonals_n(l,NO,N,I) . 
diagonals (NO, N, 10 & I) diagonals_n(l ,N0 ,N, 10) , 
is(Nl, + (N0,D) , diagonals(Nl,N,I) . 

diagonals_n(N0,Br,N,I0) is(N,+(N0,Br) ) , ! , diagonal (NO, Br, 10) . 
diagonals_n(N0,Br,N,I0 & I):- 

diagonal(N0,Br,I0) , is (N1 ,+(N0 , 1) ) , diagonals_n(Nl ,Br ,N, I) . 

diagonal (NO ,Br ,free( [NO, length (MidN) ,N] ) ) 
is(N,+(N0,Br)) , is (MidN, -(Br , 1) ) . 
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Abstract. We introduce a computing mechanism of a biochemical in- 
spiration (similar to a P system from the area of Computing with Mem- 
branes) which consists of a multiset of symbol-objects and a set of finite 
state transducers. The transducers process symbols in the current multi- 
set in the usual manner. A computation starts in an initial configuration 
and ends in a halting configuration. The power of these mechanisms is 
investigated, as well as the closure properties of the obtained family. 
The main results say that (1) systems with two components and an un- 
bounded number of states in each component generate all gsm images of 
all permutation closures of recursively enumerable languages, while (2) 
systems with two states in each component but an unbounded number 
of components can generate the permutation closures of all recursively 
enumerable languages, and (3) the obtained family is a full AFL. Result 
(2) is related to a possible (speculative) implementation of our systems 
in biochemical media. 



1 Introduction 

The present paper can be seen as a contribution both to Natural Gomputing, 
in the area of Gomputing with Membranes (P systems, see i, |H], m, m, 
ca, ca, or the survey in j^), and to Distributed (Parallel) Gomputing, in the 
multiset rewriting area (see, Q, ID, la , etc.). 

We start from the following speculation concerning a possible biochemical 
implementation of a finite state transducer acting on a multiset. Assume that the 
elements of the multiset are chemical compounds (for instance, DNA sequences) 
which swim in a given solution. A transition rule of the form sa — > xs', where 
s, s' are states of the transducer, can be realized by a catalyst (for instance, an 
enzyme) which is able to change its state. That is, the catalyst C, in state s, 
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assists the compound a to evolve to the set of compounds x and changes the 
state to s' (thus, C is a “quasi-catalyst”, having a “memory” of its actions; this 
is very much similar to P systems with bi-stable catalysts, introduced in d, 
where catalysts with two possible states are used). Of course, if the number of 
states of a transducer is large, this idea has no chance to be implemented, so 
the natural question appears to consider only transducers with a small number 
of states. Transducers with a small number of states are weak, so the suggestion 
arises to put several such transducers to cooperate in a well defined manner. 

In this way, we are led to the following kind of a “biochemical” computing 
mechanism. In a given space (a membrane) we have a multiset of objects, iden- 
tified with symbols from a given alphabet. In the same space, we place several 
finite state transducers (generalized sequential machines) . In a parallel manner, 
these transducers take symbols available around and, depending on their states, 
produce new symbols and change their states. In this way, a new configuration of 
the system is obtained. A sequence of such transitions among configurations is 
a computation] a computation is complete if it halts, that is, no further move is 
possible in its last configuration. In this way, a mapping from the initial multiset 
of objects to the multiset present in the halting configuration is defined. We can 
also associate a set of strings with a computation, as in P systems: we distinguish 
a terminal set of symbols and construct the string of terminal symbols appearing 
during the computation, in the order they are produced; when several terminal 
symbols are introduced at the same time, then any ordering of them is accepted 
(thus, several strings are associated with the same computation). 

Consequently, we have here a variant of a P system, with only one membrane 
(so, a particular case from this point of view), but with the evolution rules of a 
powerful form: finite state machines, which remember by their states some infor- 
mation about their previous work. Still, such machines are among the simplest 
we can consider. (This could make appropriate the term colony for our device, 
in the sense introduced in |^, of a collectivity of as simple as possible devices 
working together.) 

Somewhat expected (this happens in general in distributed systems, P sys- 
tems included), the power of our computing machinery is rather large: systems 
with only two components are able to generate all recursively enumerable lan- 
guages modulo a permutation; even the gsm images of permutation closures of 
recursively enumerable languages can be obtained in this way. If we left free the 
number of components, then systems with only two states in each component 
are sufficient in order to generate all permutation closures of recursively enu- 
merable languages. This is a good result from the point of view of the possible 
(but never practically tried. . . ) biochemical implementation: bi-stable catalysts 
suffice. Moreover, the family of languages generated by our devices is a full AFL 
(Abstract Family of Languages), so we can appreciate it as being very large. 
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2 Prerequisites 

For elements of formal language theory we shall use below we refer to We 
only specify some notations. 

For an alphabet V, V* is the free monoid generated by V, A is the empty 
string, and = V* — {A}. The length oix €V* is denoted by |x|, while |a;|a is 
the number of occurrences of the symbol a in the string x. The set of subwords 
oix &V* is denoted by Sub{x). IfV = {oi, . . . , a„} (the ordering is important), 
then 'Pv{x) = (|a;|ai, ■ • • , |a^|o„) is the Parikh vector of the string x G V* . For a 
language L QV*, 'Pv{L) = | a; G L} is the Parikh set of L and p{L) is 

the permutation closure of L. 

A Chomsky grammar is written in the form G = {N,T, S, P), where N is 
the nonterminal alphabet, T is the terminal alphabet, S is the axiom, and P is 
the set of productions. We denote by RE the family of recursively enumerable 
languages. 

In general, for a family FL of languages, we denote by pFL, and 

pjbound families of permutation closures of languages in FL, of languages 
in FL over the one-letter alphabet, and of the strictly bounded languages in 
FL, respectively (a language L C V* is strictly bounded if there are n different 
symbols ai, . . . ,Qn € V such that L C a* . . . a* ). 

A notion which will be very useful below is that of a matrix grammar. Such a 
grammar is a construct G = {N, T, S, M, G), where N, T are disjoint alphabets, 
S G N, M is a, finite set of sequences of the form (Ai — >■ xi, . . . , A„ — x„), 
n > 1, of context-free rules over NUT (with G N,Xi G (NUT)*, in all cases), 
and C is a set of occurrences of rules in M {N is the nonterminal alphabet, T 
is the terminal alphabet, S is the axiom, while the elements of M are called 
matrices) . 

For w,z G {Nut)* we write w => z if there is a matrix {Ai — >• xi , . . . , A„ — >• 
Xn) in M and the strings Wi G {N U T)*, 1 < f < n -|- 1, such that w = wi, z = 
Wn+i, and, for all 1 < i < n, either Wi = w[Aiw'l ,Wi+i = w{xiw” , for some 
w[,w'l G {N U T)*, or Wi = Wi+i, Ai does not appear in Wi, and the rule 
Ai — >■ Xi appears in G. (The rules of a matrix are applied in order, possibly 
skipping the rules in G if they cannot be applied; we say that these rules are 
applied in the appearance checking mode.) If C = 0, then the grammar is said 
to be without appearance checking (and G is no longer mentioned). 

We denote by the reflexive and transitive closure of the relation =^. 
The language generated by G is defined by L{G) = {ic G T* | S' =>* u>}. 
The family of languages of this form is denoted by MATac- When we use only 
grammars without appearance checking, then the obtained family is denoted by 
MAT. 

A matrix grammar G = {N, T, S, M, C) is said to be in the binary normal 
form if N = Ni U N 2 U {S, f}, with these three sets mutually disjoint, and the 
matrices in M are of one of the following forms: 



1. (S^ AA), with X G A^i,A G A^ 2 , 

2. {X ^Y, A ^ x), with X,Y G Ni, A G N 2 ,x G {N 2 U T) 
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3. {X ^Y,A^^),withX,Y £Ni,AeN 2 , 

4. (X — >■ A, A — >■ a;), with X G Ni, A G N2, and x GT* . 

Moreover, there is only one matrix of type 1 and C consists exactly of all rules 
A — >■ f appearing in matrices of type 3. One sees that f is a trap-symbol; once 
introduced, it is never removed. A matrix of type 4 is used only once, at the last 
step of a derivation. 

According to Lemma 1.3.7 in 0, for each matrix grammar there is an equiv- 
alent matrix grammar in the binary normal form. 

A multiset over an alphabet V = {ai,...,a„} is a mapping fj, : V — > 
N U {00}. A multiset can be given in the form {(ai, /i(ai)), . . . , (a„,^(a„))} or 
can be represented by any string w G V* such that 'I'v{w) = (/i(oi), . . . , 

We shall make below an extensive use of the string representation of a multiset. 
For the sake of mathematical accuracy, the multiset {(oi, |w|ai), • ■ • , (a«j l'f^|a„)} 
represented by a string ic G is denoted by /r(w). 

For ai gV and a multiset /i over V , we say that ai belongs to /i and we write 
ai G fJLii fJ,{ai) > 1. 

We say that the multiset fii is included in the multiset /i2, and write /ii C p_2, 
if fJ,i{a) < /T2(o) for all a G V. The union of ^1,^2 is the multiset defined by 
(^1 U A*2)(a) = fJ,i{a) + 1^2(0), for all a G V. The difference of two multisets, 
Ml ~ M2, is defined here only when ^2 C /zi, by (/ii — M2)(o) = Mi(o) ~ M2(a), for 
all aGV. 

3 P Systems of Transducers 

We now introduce the computing mechanisms we investigate in this paper. 

Let us first recall that a gsm (generalized sequential machine) is a construct 
7 = {K,Vi,V2,so, F, P), where K,V\,V2 are alphabets (the set of states, the 
input and output alphabets, respectively), Sq G K (initial state), F C K (final 
states), and P is a finite set of rewriting rules of the form sa — >■ xs' , for s,s' G K 
and a G Vi,x G V^- 

For s, s' G K,yi,x G V2,V2 G G Vi we write yisay2 => y\xs'y2 if 

sa — >■ xs' G P. Then, for w G V* we define 'y{w) = {z G 1^2* I soic =>* zsf^ for 
some Sf G P}. For L C V^, we define 7(A) = 
the image of L by the gsm 7. 

For a family FL of languages, we denote by gsm{FL) the family of gsm 
images of languages in FL. 

Here we consider the gsm’s not as operating on strings (and languages), but 
as operators on multisets of symbols; in this case, the final states are no longer 
necessary. 

A system of transducers (in order to remind the Membrane Computing, we 
say, shortly, a PT system) of degree n,n > 1, is a construct 

n = (C,T,wo,7 i,---, 7 ™), 



where: 
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- F is an alphabet (its elements are called objects), 

- T Q V (the terminal alphabet), 

- Wo € {V — T)* represents a multiset over V (the initial multiset), 

- = {Ki, V, So, I, Pi), I < i <n, where Ki is a finite alphabet (set of states), 
so,i G Ki, and Pi is a finite set of transition rules of the form sa — >■ xs' , for 
a G V — T,x G V* ,s,s' £ Ki (thus, each ji is a gsm without final states and 
with identical input and output alphabets, namely equal to V; we say that 

is a component of II). 

Note that wq is a string over V — T representing a multiset and that the 
terminal symbols cannot be processed by the rules in the components of 77. 

Any (n + l)-tuple {w, s\, . . . , s„), with w G V* , Si G Ki, 1 < 7 < n, is called 
a configuration of 77; (wq, sop, . . . , so,n) is the initial configuration of 77. 

For two configurations (u>, Si, . . . , s„), (u>', s(, . . . , sj^) we write (w, Si, . . . , s„) 

{w' , s'l, . . . , s[,) (and we say that we have a transition between the two 
configurations) if the following conditions hold: 

1. there is fc > 1 and there are the indices i\, . . . ,ik G {1,2, ... ,n} such that 

- /r(oij . . . OiJ C p,{w), for Oi- GV,1< j <k, 

- Si^ Qi^ -)■ Xi^ s[. G Pij , for 1 < j < k, 

- fi{w') = {pl{w) - /r(aii . . . tti„)) U {ti{xifij U pL{xifi) U . . . U ti{xi,^)), 

- for Z G {1, 2, . . . ,n} — {ii, . . . , ik} we have si = s(; 

2. the set {ii, . . . , ik} is maximal, in the sense that there is no transition 
SrOr — >■ Xrs} G Pr for some r G {1, 2, . . . , n} — {ii, . . . , ik}, 

such that Or G tJ^{w) — pL{ai.^ . . .Oifi) 

(no further object in the multiset p.{w) can be processed by a gsm different 
from those mentioned at the previous point, . . . ,^ifi). 

In plain words, each gsm which can use a transition rule must do it; if this 
is not possible, then we remain in the same state. 

A configuration {w, s\,. . . , Sn) is said to be a halting one if there is no config- 
uration (w', s^, . . . , s(j) such that a transition (w, Si, . . . , Sn) => (w', s^, . . . , s(j) 
is possible. 

As usual, we denote by ^=7* the reflexive and transitive closure of the relation 
=^. A sequence of transitions is called a (complete) computation if it starts in 
the initial configuration and ends in a halting configuration. 

There are several possibilities of associating a result with a computation. We 
choose the variant also followed in the P systems area, see |S|, 0: we collect 
the terminal symbols, in the order in which they are introduced, and form a 
string; if several terminal symbols are introduced at the same time (by the same 
component of 77, using a rule sa — >■ xs' with x being a string, or by several 
components), then all the orderings of those symbols are allowed, hence a set of 
strings is associated with the same computation. The set of strings of this form 
is the language generated by 77 and it is denoted by L{II). 

Formally, this language is defined as follows. For two strings w, w' G V* such 
that I^t{w) < I^t{w') (componentwise), we denote by L{w' — w) the set of words 
X GT* such that I^t{w') = I/t{w) + 'Pt(x) (the set of strings over T composed 
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of symbols which appear in w' and not in w). Then, for a halting computation 

^0,l5 ■ ■ • ; ^0,n) !■' (^1; ^1,1; • ■ • j • ■ ■ !■' (^m? ■ 5 

we define 

T(Z\) = L{wi - Wo)L{w 2 -Wi)... L{Wm - Wm-l)- 

The language L{II) is the union of all languages L{A), for A being a halting 
computation with respect to U. 

We denote by PTL^ the family of languages generated by PT systems of 
degree less than or equal to n,n> 1; the union of all these families is denoted 
by PTL. 

Directly from the definitions, we get: 

Lemma 1. PTLn C PTLn+i, n> 1 . 

4 An Example 

In order to illustrate the definition and the work of a PT system, let us consider 
the system (of degree 3) 



n = {V,T,wo,^m 2 , 73 ), with 
V = {a, a', a”, a, b, c, d, e, /, g, h}, 

T={a}, 

Wq = a'a'b^d, for some n > 1, 
and the following components: 

7 1 = ({sq,!. Sip}, P, Sop, Pi), 

Pi = {S0,1^ CSip, Sipft. — Sop}, 

72 = ({so,2, Sip, S2,2, Sap, S4p}, V, Sop, P2), 

P 2 = {sopU — f nS4p, S4pft — US4p, S4pd — ^ ^S4p, S4pC — )■ CS4p, 
so,2d — >■ dsop, sopc— >-sip, sipo' — >■ a"a"sip, sipe — >■ /s2p, 
S2p5“fS3p, Sapa" — >■ a'sap, Sapc — >■ /sop}, 

73 = ({so,3, Sip, S2p, Sap, S4p}, V, Sop, Pa), 

P3 = {sopd -)> dsop, sopd -f edsip, sipa' -)> as2p, S2pd ds2p. 

Sip/ ffsap, sapd — f dsap, sapd —>■ edsap, sapo" —>■ ds2p, 

S4p/ ^ /isop}. 

The components of this PT system are represented graphically in Figure 1. 

The system works as follows (and halts in a configuration which contains 2” 
copies of the symbol a). 

If in state Sop the component 72 chooses to go to state sap, then we never 
come back to sop. Assume that at the first step 72 uses the rule sopd — >■ dsop. 
Simultaneously, 71 transforms one occurrence of 6 in c and 7a remains in the 
initial state (the occurrence of d is used by 72). At the next step, 72 can pass to 
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si,2, by the rule so,ic — >■ si^2, making use of the symbol c introduced by the first 
component; 71 will wait in state sip until a symbol h is produced. 

In state we can replace each a' by two copies of a" . Assume that at 
this time 73 remains in state So,3- The component 72 cannot leave Si^2 before 
having produced a copy of e in 73. At any moment, 73 can use the rule so,3<i — >■ 
edsi^s- If in the current multiset we still have occurrences of a', then 73 will now 
introduce d, passing to state 52,3 (this is obligatory, because the symbol / is not 
available). The computation will never stop, because of the rule 82,30, — >■ as2,3 
which can be used forever. No output is obtained in such a case. Therefore, the 
rule 80,3d — >■ ed8\,3 in 73 should be used after transforming all symbols o' into 
a" (doubling them). 






After using si_2e — >■ /s2,2 in 72, we can use 8\,3f — >■ 553,3 in 73. At the next 
step, 72 can use g in order to pass to state 53,2 while 73 stays in S3, 3 any number 
of steps. During this time, 72 replaces each a" by a' (in state 33,2). Again we can 
control whether or not this process is complete, by means of symbols d, e, d: if 73 
uses the rule 83,3d — >■ eds4,3 while symbols a" are still present, then at the next 
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step 73 will introduce the trap-symbol a (we cannot use the rule 54,3/ — f hso,3, 
because no symbol / is available). 

If all symbols a" were replaced by a' , then 73 introduces the symbol h and 
returns to its initial state; note that 72 is already in the initial state. Using the 
symbol h, also 71 can return to its initial state. In the current multiset, the 
number of occurrences of a' is doubled in comparison with the number of such 
symbols in the previous configuration. During this time, a copy of b has been 
transformed in c. 

When 72 enters the state S4.2, we have two possibilities. If any copy of b or 
of c is present, then the computation will continue forever. If no copy of b or 
of c is present, then the computation can continue only until all symbols a' are 
replaced by a and then it stops: the component 73 can proceed further only using 
occurrences of the symbol / and such symbols are produced only by 72, which 
is no longer able to introduce /. 

In conclusion, we can double n times the number of occurrences of a, that is, 
we stop in a configuration which contains 2” copies of a. We may say that the 
system 77 above computes the function /(n) = 2",n > 1. 

One can modify this system in order to generate the language {a^ \ n > 1}, 
but we leave this task to the reader. 

5 The Power of PT Systems 

We pass now to investigating the generative power of PT systems. The main 
result in this sense is the next one, showing that our mechanisms are very pow- 
erful. 

Theorem 2. gsm{pRE) C PTL2, strict inclusion. 

Proof. (1) Let us first prove the inclusion. 

It is known that RE = MATac] this implies that pRE = pMATac. Consider 
a matrix grammar with appearance checking G = (N, T, S, M, C) and a gsm 
7 = {K,T,V2,qo, F, P). Assume that G is in the binary normal form (hence 
N = 7Vi U A^2 U {A, f}) and that it contains k matrices of the form : {Xi — >• Oj, 
Ai — >■ Xi), I < i < k, for some Xi G Ni,ai G NiU {A}, Ai G N2,x G {N2 LIT)*, 
and n matrices of the form mj : {Xj Yj , Aj ^ -\), k + 1 < j < k + n, for some 
Xj,Yj G Ni,Aj G N2. 

For a string x G {N2 LIT)* we denote by x the string obtained by replacing 
each terminal symbol a which appears in a; by a (the nonterminal symbols remain 
unchanged) . 

We construct the PT system (of degree 2) 

n = (U,U2,ico,7i.72), 

V = A^i U 7V2 U V2 U {c, d, e, f} U {a | a G T}, 

Wo = XAc, for (S' ^ XA) G M, X G Ni, A G N2, 



with the following components: 
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71 — {K\,V, So,i, Pi), 

Ki = K\J {(?/} U I 0 < i < fc + n} 

U {[q,r,i] \r:qa^zq'&P,q,q'&K,a&T,z&V^,l<i< |z|, |z| > 2}, 

Pi = Si.i, Si,i4li UiXiSQ^i, Si^ic ts*,i I 1 < * < fc} 

u — y ^ ^*^0,1, ^ \ — j — 

u {soac -)> cqo} 

U {qc cq \ q G K} 

U {qa — >■ aq' \ qa -G aq' G P, a G V2 U {A}} 

U {qa^ ai[q,r,l], [q,r,l]c ^ ca 2 [q,r, 2 ], . . . , 

[q,r,t- 2]c -)> cat-i[q,r,t- 1], [q,r,t- l]c catq' \ 

for r : qa ^ zq' G P, z = aiU2 . . .at,t> 2 ,ai G V2, 1 < i < i} 

U {qc -Gcqf\qGP} 

U {qfa — >■ aqf | a G A^i U A^2 U {d | a G T} U {f}}, 

72 = ({^0,2}, sq,2> -P2), 

P2 = {so,2e — >■ dso, 2 } U {so,2a — >■ aso.2 | a G iVi U A^2 U {t}}- 

Let us examine the work of this system. 

We start from the multiset represented by XAc; as long as a nonterminal 
symbol of G is present, the component 72 cannot stop. Therefore, we can halt 
only when no nonterminal symbol is present. 

If 7i moves from so,i to qo (the initial state of the gsm 7), then it never 
returns to sop. From go, we simulate the work of 7, in the following way. 

First, note that in each state q G K we can work forever using the rule 
qc -G cq. Thus, we have to reach the state qf. Also in this state we can work 
forever if a nonterminal symbol of G is present or any symbol of the form a, 
for a G T, is present. Such barred symbols are introduced by the rules which 
simulate matrices in M (see below). Consequently, after entering the state go, 
we can finish the work of 71 only if no nonterminal is present and all terminals 
which are present (in the barred form) are correctly parsed by the gsm 7. 

The parsing through 7 can be simulated at any time; in particular, we can do 
that after eliminating all the nonterminals (this means that a derivation in G is 
completely simulated, see below). This is ensured by the fact that in each state 
q G K we can wait as much as we need, by using the rule gc — >■ eg. In this way, 
we have at our disposal all the terminal symbols, hence we can process them 
in any order we want. Otherwise stated, any permutation of a string generated 
by G is available and we can translate it. Note also the important fact that the 
rules of the form qa -G a\[q, r, 1], [g, r, l]c -G co2[g, r, 2], . . . , [g, r, t — l]c -G catq' , 
corresponding to a rule r : go — >■ oi . . . atq' from P, introduce the symbols oi, . . . , 
at one by one, hence in the order imposed by the rule r (if all these symbols were 
produced at the same time, then any permutation of them has to be considered, 
which is not correct). 

What remains is to show that each derivation with respect to G can be 
simulated in 77. 




Multiset Processing by Means of Systems of Finite State Transducers 



149 



Assume that we are in a configuration (w, sq.I) so, 2) (initially, w = XAc). 

If we use a rule so,iAii — >■ in 71, for rrii : {Xi — >■ ai,Ai — >■ Xi) in M, 
then at the next step we have to use the rule Si^iAi — >■ aiXiSo^i. Indeed, if we 
introduce the symbol f, by using the rule Si^c — >■ fs^i, then the computation 
never stops. This means that the use of the matrix is correctly simulated, 
both its rules were used. The process can be iterated. 

If in the configuration (w, so.i; 50,2) we use a rule so^iXj — >■ esjp for some 
matrix mj : (Xj — f Tj , — >■ f), fc + 1 < j < fc + n, from M, then in state Sj^i 

we have two possibilities. 

If the symbol Aj is present in the current configuration, then we have to use 
the rule Sj^iAj — >■ tsj,i and the computation will never finish. If the symbol Aj is 
not present, then 71 cannot work, we remain in state Sj^i until the component 72 
uses the rule So,2e — f dso,2 (the symbol e is now available). Using the symbol d 
introduced by 72, 71 returns to its initial state and the symbol Yj is introduced. 
One can see that again we simulate correctly a matrix in M , namely one with a 
rule used in the appearance checking manner. The process can be iterated. 

In all cases, we get computations which can halt only when we correctly 
simulate the matrices of G. As we have seen above, when the derivation in G 
which is simulated by U is terminal, and only in this case, we can also terminate 
the computation, reaching a halting configuration. In conclusion, L(U) consists 
exactly of all strings w € 7(2), for z being a permutation of a string in L{G). 
Therefore, L{II) = 7(p(L(G))), hence gsm{pRE) C PTL2- 

(2) This is a proper inclusion. Indeed, let us consider the language 
L = {aba^ba'^b . . . ba^ ba^ ^ b. . . ba^ | m > 1}. 



This language can be generated by the matrix grammar G = ({S', A, A', A, F, 
Z, f}, {a,b}, S,M,C), with the following matrices: 



0. (S^ AA), 

1. (X ^ X, A ^ aA' A'), 

2. (A ^ 6F,A^ t), 

3. (F ^ F,A' ^ A), 



4. (F ^ A, A' ^ t), 

5. (F ^ bZ,A' t), 

6 . (F — >■ Z, A — >■ a), 

7. (Z^ A). 



One can see that the blocks a^" (by a “block” we understand a maximal subword 
consisting of occurrences of a) from the strings of L are produced in sequence, 
from left to right. 

As in the first part of the proof, we can construct a PT system which simu- 
lates the work of G; because we use no gsm, the construction should be slightly 
modified. We give only the transitions of the two transducers, the states and the 
alphabets are obvious: 



-Pi = |so,i-^ si,i) sipA —>■ AaA'A'sop, sipc —>■ fsi^i, 

•sopA — >■ es2,i, S2 ,iA — >■ fs2,i, S2,id bYso,i, 

•SopA —>-537, S37A' — >■ FAso,l 5 S 3 ,lC — >■ fS 37 , 
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S0,1^ — >■ eS44, — >■ fs44, S4^id — >■ Xso,l, 

so,i^ es5^, — >■ fs54, ss^id — >■ bZso^i, 

SQ,iZ — >■ Se^, S64^ Q^S6 ,i}; 

P2 = {so,2e — >■ <iso,2} 

U {so,2C( — >■ q;so,2 I O' G {A, A' , X,Y, Z,'\}. 

The reader can check that L{II) = L, that is, L G PTL 2 - 

It remains to prove that L ^ gsm(j)RE). Assume the contrary, and take 
a language Lq G RE, Lq C V*, and a gsm g = {K, V, {a, 6}, sq, E, P) such that 
L = g(p{Lo)). For each string w = aba?b . . . ba^ in L there is a string wq G p{Lq) 
such that sqWo =>* wsf with respect to the transitions in P and Sf G E. 
Assume that K contains r states and denote 

k = max{|z| | sa — >■ zs' G P}. 

For 2” > fc • (r + 2), when translating wg into w, for each block a^’’ with 
p > n there is a state s which is used at least twice. At least for such a state, 
the corresponding cycle introduces at least one symbol a. Consequently, for each 
such a block there is w G Sub{wo) and s G K such that 

su =^* a*s, for some t > 1. 

Now, if m is large enough, then the same state s as above is used for two 
different blocks ,p yf q. Assume that p < q; the case when p > g is 

similar. That is, we can write wg = w'gUWgVWg and 

Sgw'gUWgVWg =^* x' SUWgVWg =>* x' o!' SWgVWg 

=>* x'a*x''svw'g =^* x'a*x”a* sw'g" =^* x'a^x"a* x'"sf, 

for some t, t' > 1, such that the string x” contains at least one occurrence of b, 
and x'a*x"a* x'" = w. 

Clearly, also the string zg = w'guvwgw'g is in p{Lg), because of the permu- 
tation closure, and also the translation 

Sgw'gUVWgWg =>* x' a* Q* x”x'”sf 

is possible in g. The obtained string is of the form 

z = abaH . . . ba^'‘+^'b . . . ba'^‘‘~^'b . . . ba^’" . 

Such a string is not in L, a contradiction. Consequently, we do not have L G 
gsm{pRE), and this concludes the proof. 

The previous theorem has a series of interesting consequences. 

On the one-letter alphabet the permutation of a language is equal to the 
language, so C PTL^"®- The inclusion PTL C RE follows from Church- 

Turing thesis (or can be proved in a direct, constructive, manner). Consequently, 
we get: 
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Corollary 3. = PTL^'^'=. 

If we start the construction in the proof of Theorem | 2 | from a matrix grammar 
without appearance checking, then component 72 is useless, which implies the 
next result: 

Corollary 4. gsm{pM AT) C PTLi. 

Consider the language D4 = {w G {01,02,61,62}* | It^lai = \w\bi,i = 1 , 2 }. It 
is a generator of the family of context-free languages, CF, hence each context-free 
language L can be written in the form L = 7(114), for a gsm 7. The language D4 
is context-free and permutation closed, therefore it belongs to pMAT. Because 
gsm{pM AT) C PTLi, we obtain 

Corollary 5. CF C PTLi. 

We do not know whether or not the inclusion RE C PTL2 (or RE C PTL) 
holds. For strictly bounded languages, such a relation is true. 

Corollary 6. REbound ^ pj^j^bound^ 

Proof. Consider a language L G RE,L C . . . a* , for some ai,...,a„ G V, 
mutually different. From Theorem 0 it follows that p{L) G PTL2. We can write 
L = p{L)r\al . . . a* . An intersection with a regular language can be realized by a 
gsm; again from TheoremEl we get L G PTL2. Therefore, C PTL2°“"''^. 

The converse inclusion follows from Church- Turing thesis (or can be directly 
proved) . 

6 On the State Complexity of PT Systems 

The component 71 of the PT system in the proof of Theorem El has a number 
of states which depends on the gsm 7 and the starting matrix grammar G. We 
do not see a way to avoid the dependence on 7. However, if we do not look for 
gsm images of permutation closures of languages in RE, then we can avoid the 
dependence on the starting grammar G: the hierarchy on the maximal number 
of states in the components of PT systems which generate languages in pRE 
collapses at the second level: 

Theorem 7. For each language L G pRE there is a PT system II such that 
L = L{n) and each component of II has at most two states. 

Proof. Starting from a matrix grammar G = {N, T, S, M, G) in the binary nor- 
mal form, with k matrices of the form : {Xi — >■ a.i,Ai — > xf), 1 < i < fc, and 
n matrices of the form mj : {Xj Yj,Aj — >■ f), fc -|- 1 < j < k + n, we construct 
a PT system 



^ (C ^0, Tl, ■ ■ ■ , Tfc+n, Tfc-t-n-t-1, Tfc+n+ 2 ), 

V = A^i U iV2 U T U {c, d, e, f} U {« I a G T}, 

Wo = XAcc, for {S XA) G M, X G Ni, A G N2, 



with the following components: 
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li = {{so,i,si,i},V,so,i,Pi), for 1 < z < fc, with 

Pi — t )■ CXiXiSQ^i^ ^l,iC t 

7j = ({soji si,j}i ^oj,Pj), for k + 1 < j < k + n, with 

Pj — ^SQ^jXj y €S\^j^ S\^jd y PjSQ^j^ si^jAj y 
5^fc+n+l — ({ -^Ojfc+n+l } 5 ^5 '^0,fc+n+l ; -Pfc+n+1 ) 7 

Pk+n+1 = {so,k+n+ie — t dso,fc+n+l} 

U {so.k+n+icr — t aSo^k+n +1 \aGN1UN2U {t}}, 

Jk+n+2 = {{so,k+n+2, Si^k+n+ 2 } ,V, So,k+n+2, Pk+n+ 2 ) , 

Pk+n+2 = {so,fc+n+2« aSo,fe+n+2j S 0 ,fc+n+ 2 h — t aSo,k+n+2 \ « & T} 

U {so,k+n+2C — t Sl,fe+n+ 2 } 

U {si,fc+n+20! “t aSi^k+n+2 | « G A^i U A^2 U {a | O G T} U {t}}- 

This system works in a way similar to that in the proof of Theorem Q The 
components 7 i , 1 < * < fc, simulate the matrices in M which do not involve rules 
used in the appearance checking mode. The components 7 j, /c + 1 < j < /c + n, 
simulate the matrices which contain rules used in the appearance checking mode. 
Note that in each moment only one of these components can work, because only 
one occurrence of a symbol from 7Vi is present in the current multiset; moreover, 
the use of a rule Xj — >■ Yj, for any i, is completed only when the simulation of 
the matrix in which this rule appears is completed (the symbol Yi is introduced 
by a rule which returns to the initial state of the component 7 ^). The component 
7 fc_i_„_i_i is used, as in the proof of Theorem|21 for ensuring the correct simulation 
of the matrices which contain rules used in the appearance checking manner. 

The component 7 fc +„+2 is used for permuting the terminal symbols, at the 
end of a computation, in such a way to obtain all permutations of strings in L{G) 
(we can wait in state SQ^k+n +2 as long as we need). Moreover, this component 
checks whether or not the derivation is terminal: if any nonterminal of G is 
present in the configuration, then we can cycle in state si^fe+n+ 2 . 

In conclusion, L{1J) = p{L{G)). 

The number of components of the system constructed in the proof of Theorem 
Q depends on the starting grammar. 

In all the results from this and the previous section, the length of the string 
X in rules sa — >■ xs' of the components of the PT systems we have used can 
be bounded by two: start from a matrix grammar in the binary normal form 
having the string z in matrices {X a, A ^ z) of length at most two (this 
can be arranged - see ^]). One can see from the previous constructions that the 
obtained PT system has the desired property. 

7 Closure Properties 

A way to estimate the size of a family of languages is to consider its closure 
properties. From this point of view, the family PTL seems to be rather large: 
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Theorem 8. The family PTL is a full AFL. 

Proof. It is easy to see that the family PTL is closed under arbitrary gsm map- 
pings; this implies the closure under arbitrary morphisms, intersection with reg- 
ular languages, and inverse morphisms. 

Union. Consider two PT systems Ili = (V^, T^, ruo.i, 7i,i, ■ • ■ , 7ni,i), i = 1 , 2 . 
Without loss of generality we may assume that the states used by the components 
of ill are different from those used by the components of II2 and that also Vi — Ti 
is disjoint of V2 — T2. We construct the system 

ii = (Vi U P2 U {d, di,d2}, Ti U T2, dwo, 1^0,2, 7 o, 7 i,i. ■ • ■ , 7 ni,i, 7i,2. ■ • ■ > 7n2,2). 
with 



7o = ({so}, V1UV2U {d, di,d2}, So, {sod -)> df^so, sod -)> dlf^so}), 

li,j ~ {So,ij}> {^O.iji^i 

for all 1 < i < Uj,j = 1,2. 

One can easily see that we first work in the new component, 70, introducing 
either ni occurrences of di or ri2 occurrences of d2. In this way, all components 
of III or all components of II2 pass simultaneously to their initial states. From 
now on, these components work exactly as they are doing in the initial system. 
Because we have only one occurrence of d, 70 can work only once, hence only 
the components of one of iii,il2 are activated. Consequently, we get L{II) = 

T(iii)UL(i 72 ). 

Concatenation. Start from two systems IIi,i = 1 , 2 , as above, with disjoint 
sets of states and sets of non-terminal symbols, and construct a new system as 
follows. Instead of a formal (highly cumbersome) construction, we indicate it in 
Figure 2 and describe it informally. 

As one can see in Figure 2 , we have two new components, 70 and 7 q, and 
modified variants of all the components of IIi and II2. In particular, for each 
7i,i,l Cl i Cl n\, we consider 7']^, which “contains” 7,^ as well as a modified 
copy of 7i^i with all states in the form s. From each state s in the copy of 
to the corresponding state s in the modified copy of 7^4 we have a transition, 
via a rule sd2 — >■ s. Moreover, for each move sa — >■ xs' from 7i_i, in the modified 
copy we introduce the rule sa — >■ Xxs' . 

The initial multiset is again dwo^iWo,2- At the first step, only the new com- 
ponent 7o can work; it introduces n\ copies of di, which make possible the 
activation of all components of TTi. These components (their copy from i) 
reaches their initial state and work as they work in 77 i. During this time, 70 can 
stay in state Si, using the rule Sid — >■ dsi, while all other components of the 
system (7 q and 7' 2, for 1 < i < ri2) are doing nothing, because they do not have 
symbols to process. 

At any moment, 70 can introduce ni copies of c?2 and pass to state S2- This is 
the moment when we want to finish the work of TTi, to check whether or not this is 
done correctly (whether or not we have a halting configuration from the point of 
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view of ill), and to pass to also simulate 772. This is ensured by the “controller” 
component 7 q. After having introduced rii copies of d2, all these copies must 
be used at the next step by the components 7^1,1 < i < rii, otherwise the 
component j'q can use such a symbol and enters the cycle s'id'2 — >■ hence 

the computation will never finish. A symbol d2 can be used by a component 7' ^ 
by the rule sd2 — >■ s, where s is the current state of 7^4 reached in 7' ^ and s is 
a copy of it. Note that rules of the form sd2 — >■ s are introduced for all states 
s, but only one per component can be used, because we are in a given state of 
each component. 




Fig. 2. The construction for concatenation. 

If we are in a halting configuration of ■ji^i (it is possible that such a con- 
figuration has been reached at a previous step and we have just waited for the 
symbol d2 to be introduced), then we also move to a halting configuration of 
the copy of 7i_i, that which uses barred states. If this is not the case, that is, at 
least one further transition can be performed in 7^1, then this is also possible in 
the copy of 7^4 now activated in 7' Because each rule of this copy introduces 
an occurrence of the symbol X, the computation is again lost: the “controller” 
component will use this symbol A, working forever. 

In conclusion, when the state S2 is reached in 70, we can continue the compu- 
tation without entering a cycle if and only if a halting configuration was reached 
in III. 

The component 70 introduces now U2 copies of d^, which activate the com- 
ponents 7( 2) 1 ^ ^ ^2; in this way, we continue working as in II2, hence we 

obtain the concatenation of the languages T( 7 Ti) and L(772). 
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Kleene closure. Consider a PT system U = (V,T,wo,^i, . . . j'jn)- We pro- 
ceed as above, indicating the construction by a picture - Figure 3 - and then 
discussing it informally. 

The new component 70 controls the iteration of using the system II (in the 
new form, where the components 7^ were modified to 7', 1 < z < n, in a way 
similar to that in the proof of the closure under concatenation), and 7g is again 
the “controller” of the correct termination of a computation in 77 before starting 
another computation. 



d/d 





Fig. 3. The construction for Kleene -I-. 



The initial multiset is dwo. At the first step we can work only in 70, where n 
occurrences of d\ are introduced. Now, each 7' can start working. We enter the 
initial states of each 7^ and then we work as in 7^ (at this time 70 stays in state 
Si and 7g stays in its initial state). At any time, 70 can introduce the symbol d2 
(again, n copies). The copies of ^2 should be used for passing from the current 
states of each 7^ to the barred version of that state in the modified copy of 7^ 
included in 7' (otherwise the computation will never stop, because Jq cycles in 
s)^). As in the case of concatenation, if the computation in 77 is not completed, 
then the symbol X is introduced and the computation will continue forever in 
7 q. At this time, 70 passes to state S3. At the next step, each component of the 
system returns to its initial state, by rules of the form sds — 1- Sq j, made active 
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by the introduction of n copies of ds by 70. At the same step, also 70 returns to 
its initial state. 

In this way, we can get the concatenation of any number of strings in L{II). 
After producing any number of strings in L{U) (maybe only one), we can pass 
to state S5 of 7o, which ends the computation. 

In conclusion, we produce L( 7 T) + , which concludes the proof of the closure 
under Kleene + and the proof of the theorem, too. 



8 Final Remarks 

We have here introduced a class of computing models - called PT systems - 
which belongs both to Natural Computing area (Computing with Membranes, 
0 , 0 , etc.) and to Multiset Processing ( 0 , 0 , 0 , etc.): several finite automata 
with outputs (gsm’s) swim in a space where a multiset of symbols is present and 
they process these symbols in a parallel manner. We prove that such machines are 
rather powerful: PT systems with two components (and no bound on the number 
of states of each component) can generate all (and more than all) gsm images 
of permutation closures of recursively enumerable languages, while PT systems 
with an unbounded number of components, each of them having only two states, 
can generate all permutation closures of recursively enumerable languages. 

We do not know whether or not these two parameters, the number of states 
and the number of components, induce a doubly infinite hierarchy of languages. 

Several other problems remain to be investigated. For instance, we have con- 
sidered here non-deterministic gsm’s. What about using only deterministic com- 
ponents in our systems? Actually, we have here two types of non-determinism, 
one at the level of components (using non-deterministic gsm’s) and one at the 
level of the whole system: in a given configuration, the component which takes a 
copy of a symbol and processes it is non-deterministically choosen among those 
which can do it. For instance, if we have n copies of the symbol a and n -I- 1 
components can take this symbol at that moment, only n of them will work on 
a; the remaining component will either wait, or will use a symbol different from 
a, if this is possible. 

A way to diminish the non-determinism at the level of the system is to 
consider a priority relation among components: in each moment, the components 
are enabled in the decreasing order of their priority. 

However, even a total ordering of components does not remove completely 
the non-determinism: if in a given state of a gsm we can read both a and b, and 
these symbols are present in the current multiset, then we may choose one of the 
two possible steps (note that this does not appear when gsm’s translate strings, 
because in that case at each moment only one symbol is scanned by the read 
head) . 

On the other hand, systems which are completely deterministic (at each mo- 
ment, only one next configuration is possible) do not seem to be of much interest: 
they can proceed along a unique computation, which either stops (hence the gen- 
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erated language is empty or finite), or continues forever (hence the language is 
empty). 

The study of determinism in PT systems, at various levels, deserves a further 
investigation. 
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Abstract. In this paper output space compaction for sequential circuits 
is considered for the first time. Based on simple estimates for the prob- 
abilities of the existence of sensitized paths from the signal lines to the 
circuit outputs, optimal output partitions can be determined without 
fault simulation. The outputs are partitioned in such a way that internal 
stuck-at faults influence at most one of the outputs of a group with high 
probability. The proposed method is primarily developed for concurrent 
checking. On average with less than 4 compacted groups of outputs an 
error detection probability of 98% can be achieved. As the experimental 
results show, the method is also effectively applicable in pseudo-random 
test mode. On average for three groups of compacted outputs there is no 
reduction of the fault coverage for a pseudo-random off-line test. Since 
the proposed algorithm is of linear complexity with respect to the number 
of circuit lines and of quadratic complexity with respect to the number 
of primary circuit outputs large automata can be efficiently processed. 



1 Introduction 

As the complexity of VLSI continues to increase, the number of inputs/outputs of 
automata which are implemented on the IC as sequential circuits are increasing 
accordingly. For circuits with a large number of outputs methods of output space 
compaction are of growing interest. These methods allow to decrease the number 
of observed outputs and therefore to reduce the necessary hardware overhead for 
concurrent checking and testing. 

Until now different methods for output space compaction were considered for 
combinational circuits only. The results obtained in PI, PI, m, PI, m , 0 are 
applicable in test mode. Output space compaction for concurrent checking for 
combinational circuits was investigated in m, PI, mu. In this paper for the first 
time output space compaction for synchronous sequential circuits for concurrent 
checking is considered. The proposed method is an extension of the structural 
method for output space compaction for combinational circuits described in 0 . 
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Fig. 1. Linear output compaction for concurrent checking 



2 Linear Output Compaction for Concurrent Checking 

Linear output space compaction for concurrent checking is illustrated in Fig. 
1. The circuitry of Fig. 1 consists of the monitored circuit or the circuit under 
check (cue) with n primary outputs, a coder and a two-rail checker (TRC). The 
primary outputs are divided into k disjoint groups. The outputs of every group 
are compacted by an XOR-tree into the k compacted outputs. Each compacted 
output represents the parity of the corresponding group. The coder generates 
the k inverted compacted outputs. The outputs of the coder are compared with 
the compacted outputs of the CUC by the (self checking) two-rail checker TRC. 
An error signal of the two-rail checker indicates a fault of the CUC, the coder, 
an XOR-tree, or the (self-checking) two-rail checker TRC. 

For systems with a very large number of outputs, linear output space com- 
paction, which is widely used in fault tolerant system design, can be used to 
simplify the method of duplication and comparison. Instead of comparing all 
the outputs of the duplicated circuits using a huge comparator, only a small 
number of properly compacted outputs of the duplicated systems need to be 
compared. Nearly the same fault coverage can be achieved. 

3 Error Propagation in Combinational Cirenits 

In this section we briefly describe how the probabilities that errors of the circuit 
lines of a combinational circuit are propagated to each of the different circuit 
outputs can be simply approximated. The described algorithm is of linear com- 
plexity with respect to the number of circuit lines. The linear complexity is of 
great importance for an efficient processing of large circuits. 

We consider in this section a combinational circuit C with m inputs 
xi, . . . , Xm and n outputs j/i, . . . , j/n which is given as a netlist of AND-, OR-, 
NAND-, NOR-, XOR-, XNOR-gates and INVERTERS. We denote an error at 
a signal line s of C by e(s) and the set of all signal lines of C by Lc- 
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As a simplification we assume that the values of all signal lines s G Lc 
(including the input lines of C) are randomly chosen and equally distributed with 
the probabilities p(s=l) = p(s=0) = Under this assumption we determine for 
every circuit line s and for every circuit output y a simple approximation p{s'^ y) 
of the probability that there is a sensitized path from s to y. This probability is 
an estimate that an error e(s) at a circuit line s results in an erroneous circuit 
output y. 

Now we explain how the corresponding probabilities for the existence of sen- 
sitized paths can be approximated. At first we consider a single gate G with an 
erroneous input value. This erroneous value is propagated to the output of the 
gate if the values of all the other input lines of G are non-controlling values p. 
If we assume that the value of each signal line s is randomly chosen with the 
probabilities p(s=l) = p(s=0) = ^ then the probability that the error is prop- 
agated to the output of an AND, OR, NAND, or NOR-gate with k input lines 
is p{G) = 1/(2^“^). Since both the values 0 and 1 are non-controlling values of 
XOR- and XNOR-gates we have p{G) = 1 for these gates. For an INVERTER 
we also have p{G) = I. 

For a combinational circuit C with n outputs to every circuit line s, an n- 
dimensional vector ps = (p(s'^yi), . . . is assigned. The components 

p(s'^yi), . . . ,p{s'^yn) are the approximated probabilities that paths from the 
line s to the circuit outputs yi, ... ,yn are sensitized. 

These vectors are computed by passing through the circuit from the outputs 
to the inputs in reverse topological order and under the assumption that the 
probability of an error to be propagated by a gate G is p{G). For more details 
see 0. 



4 Error Propagation in Seqnential Circuits 

We consider now a sequential circuit S consisting of a combinational part Cs, 
p flip-flops Ri, . . . , Rp, m primary inputs x\, ... , Xm, and n primary outputs 
t/i, . . . , S' is supposed to be given as a netlist of gates and the set of all signal 
lines is denoted by Ls. 

An error e(s) of a signal line can be propagated to the primary outputs either 
directly during the same clock cycle or with a delay of TV, > 1, clock cycles 
passing the flip flop elements of the sequential circuit Wtimes. As the delay N 
increases, the lengths of the corresponding paths increase and the probability 
that one of these paths is sensitized decreases. Thereby the length of a path is 
determined by the number of AND-, NAND-, OR- or NOR-gates on this path. 
The approximated probability that a path is sensitized exponentially decreases 
with its length and therefore also with the corresponding delay N . In accor- 
dance with these considerations the proposed algorithm for the determination 
of the approximated probabilities restricts the considered paths to paths with 
a maximal delay N^ax- As the experimental results show the value Nmax = 8 
guarantees the necessary accuracy. 
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^0,1 ^0,m ^1,1 ^8,1 ^8,m 




y0,l y0,n 2/1,1 2/1, n 2/8,1 2/8, n 



Fig. 2. Iterative array of time frames 



The sequential circuit S is modeled as an iterative array As of Nmax+^ = 8+1 = 
9 time frames Cq, Ci, . . . , Cs as described, for instance, in p. The corresponding 
iterative array of 9 time frames is shown in Fig. 2. The combinational parts Ci, 
0 < i < 8 are identical to the combinational part C of the sequential circuit. 
The inputs and the outputs of the time frame Ci are denoted by Xi^i, . . . ,Xi^m 
and . . . , respectively. A signal line s of the original sequential circuit S 
corresponds to 9 signal lines sq, si, ■ ■ ■ , ss within the time frames Co, Ci, . . . , Cg 
respectively. 

The approximated probability p{s'^yij) that a sensitized path exists from 
line So of Cq to the j-th output of the i-th time frame Ci is computed by applying 
an extension of the algorithm described in Section 3. 

5 Output Space-Compaction 

In this section we describe how the groups of outputs which are compacted by 
XOR-trees can be efficiently determined. If exactly two of the circuit outputs 
are simultaneously erroneous this error can only be detected if the erroneous 
outputs are in different groups. Therefore, if the probability that two outputs y 
and y' are simultaneously erroneous for a given fault is high then these outputs 
should be in different groups of compacted outputs. 

For the sequential circuit S the approximate probability ps{s yj) that 
an error e(s) is propagated from the signal line s to the primary output yj is 
computed by 

Psi.a^yj) = ^ p{so'-^yi,j). 
i=0 

The signal line sq in the time frame Cq of A5 corresponds to the line s of the 
sequential circuit S, and p(s'^yij) is the probabilities that an error e(so) in Cq 
is propagated to the j-th output of the I-th time frame Ci, 1< I< Nmax- As we 
have already pointed out we chose N^nax = 8. 

Now we compute the approximation P(y,y'), y ^ y' , of the probability that 
a randomly chosen signal line of S is erroneous and this error is detectable at 
both primary outputs y and y' of S'. As a simplification we assume that for all 
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s € S the probabilities ps(s'^y) and ps(s'^y') for y ^ y' are independent. 
Then we obtain 



P{y,y') 



1 



^ (ps(s^y) ■Psis'^y')). 
seLs 



Using these probabilities P{y,y') we now describe a heuristic for the deter- 
mination of a partition Qk of the set Y into k disjoint groups G\, . . . , Gk with 
Gi U . . . U Gfc = U, Gi n Gj = 0 for z ^ j. 

The outputs of every group Gi, l<i<k are compacted by an XOR-tree into 
the compacted group output Zi. For a given k the partition Qk is computed in 
such a way that the sum 



P{Gk)= E 

GieQk 



P(y^^yt) 



ys,vt^Gi 

s<.t 



is minimized. 

If the probability P{ys,yt) is small for all pairs ys,Vt of outputs which are 
both elements of the same group Gi then the value P{Gk) of the partition Qk is 
also small. 

For grouping the outputs first we assign to every output yj GY a, connection 
weight W{yj), 

^(yj)= Y1 P^ypy)- 

v^y\{yj} 

Then we sort the circuit outputs in descending order of their connection weights. 
We assign the first k outputs to the k different groups Gi, . . . , G^. The remaining 
outputs are assigned according to the following rule: If k<l<n outputs are 
already assigned to the groups Gi, . . . , Gk then we assign the next output y to 
that group Gj G {Gi, . . . , G^} for which the sum 

Wj= P^y^y')- 

y'eGj 



is minimal. 

The proposed algorithm for output space-compaction is linear with respect 
to the number of signal lines or the number of gates and quadratic with respect 
to the number of circuit outputs. Thus, the proposed algorithm can be applied to 
large sequential circuits. For a given number k of groups of compacted outputs 
optimal partitions can be determined without fault simulation. 

6 Experimental Results 

Experimental results are derived for 20 benchmark circuits of the 1989 Interna- 
tional Symposium on Circuits and Systems. For each of the benchmark circuits 
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and for k = 2, . . . ,8 the partition Qk of their outputs are determined as described 
in the previous section. The necessary CPU-time of a SUN SPARC 5 is less than 
one second if the number of gates is limited to 1000. The largest investigated 
circuit is sl5850.1 with 9785 gates and 534 flipflops and the corresponding run 
time is approximately 8 seconds. These times include the computation of the 
probabilities p(s'^y) for N = 8 for all internal lines s, and for all circuit outputs 
y. The trivial partition Qi, i.e. the compaction of all circuit outputs by a single 
XOR-tree into the parity of the circuit outputs is also considered. 

Experimental results are obtained with respect to single stuck-at faults, tran- 
sient faults, and intermittent faults for concurrent checking in on-line mode. For 
on average 3.7 groups at least 98% of the errors which are detected at the outputs 
of S are detected in the same clock cycle at the compacted outputs of Sk- 

Although the method was developed for concurrent checking it can also suc- 
cessfully be applied for pseudo-random off-line testing. On average less than 3 
groups of compacted outputs guarantee that no reduction of the fault coverage 
with respect to the original circuit in a pseudo-random off-line test is obtained. 

Since the proposed algorithm is of linear complexity with respect to the 
number of circuit lines and of quadratic complexity with respect to the number 
of primary circuit outputs large circuits can be efficiently processed. 
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Abstract. A locally threshold testable language L is a language with 
the property that for some nonnegative integers k and I, whether or not 
a word u is in the language L depends on (1) the prefix and suffix of the 
word u of length k—1 and (2) the set of intermediate substrings of length 
k of the word u where the sets of substrings occurring at least j times are 
the same, for j < 1. For given k and I the language is called 1-threshold k- 
testable. A finite deterministic automaton is called /-threshold fc-testable 
if the automaton accepts a /-threshold fc-testable language. 

In this paper, the necessary and sufficient conditions for an automaton to 
be locally threshold testable are found. We introduce the first polynomial 
time algorithm to verify local threshold testability of the automaton 
based on this characterization. 

New version of polynomial time algorithm to verify the local testability 
will be presented too. 

Keywords: deterministic finite automaton, locally threshold testable, al- 
gorithm, semigroup 

AMS subject classification 68Q25, 68Q45, 68Q68, 20M07 



Introduction 

The concept of local testability was introduced by McNaughton and Papert m 
and by Brzozowski and Simon 0. Local testability can be considered as a spe- 
cial case of local /-threshold testability for I = 1. Locally testable languages, 
automata and semigroups have been investigated from different points of view 
(see m - jE], Id, Id: P5| - Id)- In PH], local testability was discussed in 
terms of ’’diameter-limited perceptrons” . Locally testable languages are a gen- 
eralization of the definite and reverse-definite languages, which can be found, 
for example, in m and Id- Some variations of the concept of local testability 
(strictly, strongly) obtained by changing or omitting prefixes and suffixes in the 
definition of the concept were studied in m, iHi, m, pni. 

Locally testable automata have a wide spectrum of applications. Regular lan- 
guages and picture languages can be described by a strictly locally testable lan- 
guages P, Id- Local automata (a kind of locally testable automata) are heavily 
used to construct transducers and coding schemes adapted to constrained chan- 
nels P|. Literal morphisms may be modelled by help of 2-testable languages |0|. 
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In Cl , locally testable languages are used in the study of DNA and informational 
macromolecules in biology. 

Kim, McNaughton and McCloskey (Cl, CD have found necessary and suf- 
ficient conditions of local testability and a polynomial time algorithm for local 
testability problem based on these conditions. The realization of the algorithm 
is described by Caron in jH]. A polynomial time algorithm for local testability 
problem for semigroups was presented in m- 

The locally threshold testable languages were introduced by Beauquier and 
Pin 0 . These languages generalize the concept of locally testable language and 
have been studied extensively in recent years (see |5|, [201, EI3)- The 

syntactic characterization of locally threshold testable languages one can find in 

Q: 

Given the syntactic semigroup S of the language L, we form a graph G(S) 
as follows. The vertices of G(S) are the idempotents of S, and the edges from e 
to / are the elements of the form esf. A language L is locally threshold testable 
if and only if S is aperiodic and for any two nodes e, / and three edges p, q, r 
such that p and q are edges from e to f and r is an edge from / to e we have 

prq^qrp 

Since only five elements of the semigroup S are considered, there exists a 
polynomial time algorithm of order OdS”!^) for local threshold testability prob- 
lem in the case of semigroups. But the cardinality of the syntactic semigroup 
of a locally threshold testable automaton is not polynomial in the number of 
its nodes ISl- This is why the study of the automaton and the state transition 
graph of the automaton is important from the practical point of view (see d, 
d) and we use here this approach. 

For the state transition graph F of an automaton, we consider some sub- 
graphs of the cartesian product F x F and F x F x F. In this way, necessary and 
sufficient conditions for a deterministic finite automaton to be locally thresh- 
old testable are found. We present here 0{n^) time algorithm to verify local 
threshold testability of the automaton based on this characterization. 

Necessary and sufficient conditions of local testability from m are considered 
in this paper in terms of reachability in the graph F x F. New version of O(n^) 
time algorithm to verify local testability based on this approach will be presented 
too. 

Notation and Definitions 

Let E be an alphabet and let 17+ denote the free semigroup on E.lfwG A+, let 
|w| denote the length of w. Let A: be a positive integer. Let ik{w) [tk{w)] denote 
the prefix [suffix] of w of length fc or re if \w\ < k. Let Fkj{w) denote the set of 
factors of w of length k with at least j occurrences. A language L [a semigroup 
S'] is called 1-threshold k-testable if there is an alphabet E [and a surjective 
morphism </> : A+ — >■ S] such that for all u, v € A+, if ik-i{u) = ik-i{v), 
tk-i{u) = tk-i{v) and Fkj{u) = Fkj{v) for all j < I, then either both u and v 
are in L or neither is in L [u4> = vfj)] ■ 
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An automaton is /-threshold ^-testable if the automaton accepts a l- 
threshold fc-testable language [the syntactic semigroup of the automaton is l- 

threshold fc-testable]. 

A language L [a semigroup S, an automaton A] is locally threshold 
testable if it is /-threshold fc-testable for some k and 1. 

A semigroup without non-trivial subgroups is called aperiodic ^ 
jl^l denotes the number of nodes of the graph F. 

denotes the direct product of i copies of the graph F. 

A maximal strongly connected component of the graph will be denoted for 
brevity as SCC a finite deterministic automaton will be denoted as DFA 
m. A node from an SCC will be called for brevity as an SCC — node. 

If an edge p — >■ q is labeled by a then let us denote the node q as per. 

We shall write p ^ q if the node q is reachable from the node p or p = q. 
In the case p ^ q and q ^ p we write p q (that is p and q belong to one 
SCC or p = q). 



1 The Necessary and Sufficient Conditions 

Let us formulate the result of Beauquier and Pin 0 in the following form: 

Theorem 11 ^ A language L is locally threshold testable if and only if the 

syntactic semigroup S of L is aperiodic and for any two idempotents e, f and 
elements a, u, b of S we have 



eafuebf = ebfueaf ( 1 ) 

Let us recall the concept of implicit operation 0 - The unary operation 
assigns to every element a: of a finite semigroup the unique idempotent in the 
subsemigroup generated by x. 

The set of locally threshold testable semigroups forms a pseudovariety of 
semigroups im, 0)- So the theorem rm implies 

Corollary 12 The pseudovariety of locally threshold testable semigroups con- 
sists of aperiodic semigroups and satisfies the pseudoidentity 

UJ LJ <jJ ± LO LO j. UJ UJ UJ 

X yz ux tz = X tz ux yz 



Lemma 13 Let the node (p, qj be an SCC-node of F^ of a locally threshold 
testable DFA with state transition graph F and suppose that p ~ q. 

Then p = q. 

Proof. The transition semigroup S of the automaton is finite and aperiodic uni. 
Suppose that for some element e G S and for some states q and p from SCC 
X we have qe = q and pe = p. In view of qe* = q, pe* = p and finiteness of S 
we can assume e is an idempotent. In the SCC X for some a, b from S we have 
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pa = q and qb = p. Hence, peae = q, qe6e = p. So peae5e = p = p(eae6e)* 
for any integer i. There exists a natural number n such that in the aperiodic 
semigroup S we have (eae)" = (eae)"+^. From theorem ITTl it follows that for 
the idempotent e, eaeebe = ebeeae. We have p = peae6e = p(eaee6e)" = 
p(eae)"(e&e)" = p(eae)"+^(e6e)"’ = p(eae)”(e6e)"eae = peae = q. So p = q. □ 

Theorem 14 For DF A A with state transition graph F the following three 
conditions are equivalent: 

1 )A is locally threshold testable. 

2) If the nodes fp,qi,rij and fq, are SCC -nodes of F^ and F'^, cor- 

respondingly, and 

(q,r) ^ (qiTi); (p,qi) ^ (r,t), (p,ri) ^ (q,tj holds in F^ 
then t = ti . 

3) If the node (di, vj is an SCC -node of the graph F^ and u ^ v then u = v. 
If the nodes fp,qi,ri^, (^q,r,t^, fq,r,t;^) are SCC-nodes of the graph F^ 

and 



(q,r) ^ (qiTi), (p,qi) ^ (r,t), (p,ri) ^ (q,ti) hold in F"^ , 
then t ^ ti . 

2) 3) 




Let us consider the nodes zebfueaf and zeafuebf where z is an arbitrary 
node of F, a, u, b are arbitrary elements from transition semigroup S of the 
automaton and e, / are arbitrary idempotents from S. Let us denote 



ze = p, ze6/ = q, zea/ = r, zeafue = ri, zebfue = qi, zebfueaf = t, 
zeafuebf = ti. 

By condition 2), we have t = ti, whence zebfueaf = zeafuebf. Thus, the 
condition eafuebf = ebfueaf CD holds for the transition semigroup S. By 
theorem im the automaton is locally threshold testable. 

1) 3): 

If the node (u, v) belongs to some SCC of the graph F^ and u ^ v then by 
lemma 01 local threshold testability implies u = v. 

The condition eafuebf = ebfueaf (JIJ, theorem fTTjl holds for the transi- 
tion semigroup S of the automaton. Let us consider nodes p, q, r, t, qi, ri, ti 
satisfying the condition 3). Suppose 

(p,qi,ri)e = (p,qi,ri), (q,r,t )/2 = (q,r,t), (q,r,tj/i = (q,r,ti) 
for some idempotents e, fi, f 2 G S, and 

(p,qi)a = (r,t), (p,ri)6 = (q,ti) (q,r)u = (qi,ri) 
for some elements a,b,u G S. Then pea /2 = pea/i and pebf 2 = pe&/i. 
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We have ti/2 = peafiuebfif2- By theorem^ pebfjueafj = peafjuebfj for 
j = 1 , 2 . So we have ti/2 = peafiuebfif2 = pebfiueafif2- In view of pebf2 = 
pebfi and fi = fifi we have ti/2 = pebf2f2ueafif2- By theorem O ti/2 = 
Pe(6/2)/2we(a/i)/2 = pe{afi)f2ue{bf2)f2- Now in view of pea/2 = pea/i let 
us exclude /i and obtain ti/2 = pea/2 ue6/2 = t. So 1 1/2 = t. Analogously, 
t/i = ti. 

Hence, ti ~ t. Thus 3) is a consequence of 1). 

3) -)■ 2): 

Suppose that (p, qi,ri)e = (p,qi,ri), = (q,r,t,ti), for some 

idempotents e, / from transition semigroup S of the automaton and 
(p,qi)a = (r,t), {p,r^)b = (q,ti), (q,r)u = (qi,ri) 
for some elements a, u, b G S. Therefore 

(p, qjea/ = (p, qja/ = (r, t) 

(p,ri)e6/ = (p,ri)6/ = (q,ti) 

(q,r)u = (q,r)/ue = (qi,ri) 
for idempotents e, / and elements a, u, b G S. 

For / = /i = /2 from 3) we have t ^ ti. Notice that (ti,t)/ = (ti,t). The 
node (ti,t) belongs to some SCC of the graph and t ^ ti, whence by by 
lemma O t = ti ■ □ 

Lemma 15 Let the nodes (q,r,tj^j and fq,r,t 2 ^ be SCC-nodes of the graph 
of a loeally threshold testable DF A with state transition graph F. Suppose 
that (p,ri) ^ (q, t^), (p,ri) h (q, t 2 ) in the graph and p h r ^ ri. 

Then ti ^ t 2 . 




some idempotents e, /2, /2 from the transition semigroup S of the automaton 
and 

(p,ri) 6 i = (q,ti), (p,ri)&2 = (q,t2), pa = r,ru = ri 
for some elements a, u, bi, 62 € S. 

If ti/2 t2 and 12/1 ti then t2 ti in spite of our assumption. Therefore 
let us assume for instance that ti / t2/i. (And so ti 12/1). This gives 

us an opportunity to consider 12/1 instead of t2. So let us denote t2 = 12/1, 
/ = /i = fi- Then 12/ = t2, tif = ti and ti / t2. Now 

pea/ = paf = r 

(p,ri)e6i/ = (p,ri)6i/ = (q,ti) 

(PTi)e&2/ = (p,ri)&2/ = (q,t2) 
ru = rue = ri 
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So we have 




Let us denote qi = qite and t = qio/i. Then 
(p,qi,ri)e = (p,qi,ri), (q,r)ue = (qi,ri), (q,r,t,tj/ = (q,r,t,tj 
So the node (p, qi, ri) is an S'C'C-node of the graph the nodes (p, q, r, t^) 
are S'C'C'-nodes of the graph for i = 1,2 and we have (q, r) ^ (qi,ri), 
(P,qi) ^ (r,t) and {p,r^) F for i = 1,2. 

Therefore, by theorem E] (2), we have ti = t and t2 = t. Hence, ti ^ t2, 
contradiction. □ 



Definition 16 For any four nodes p, q, r, of the graph F of a 
DFA such that p ^ r h fij P ^ q o,nd the nodes (p^v^), 

(q_,r) are SCC -nodes, let T 5 (y(y(p, q, r, rj^) he the SCC of F containing 
T(p,q, r,T;^) := {t |(p,ri) ^ (Q)t) o,nd fq, r,t^ is an SCC-node} 




t G Tscc(p,q,i',ri) 



In virtue of lemma El the SCC T 5 C(y(Pi qi f i f i) of a, locally threshold 
testable DFA is well defined (but empty if the set T(p, q, r,rj^) is empty). 
Lemma rm and theorem IH(3) imply the following theorem 



Theorem 17 A DFA A with state transition graph F is locally threshold 
testable iff 

1 )for every SCC-node (p, qj of F"^ P q implies p = q 
and 

2)for every five nodes p,q,r,qj^,ri of the graph F such that 

— the non-empty SCC Tscc{p,^.AAi) Tsc'c(Pj q^ qi) exist, 

— the node (p, q^iTi^ is an SCC-node of the graph F^, 

— (q,r) ^ (qiTi) in 



holds Tscc(p,q,fTi) = ^scc(p,r,q,qi)- 
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Tscc(p. q, r, I’l) = r5cc(p, r, q, qj 



2 Algorithm to Verify the Local Threshold Testability 

A linear depth-first search algorithm finding all SCC of the given directed graph 
(see H) |23| or USD will be used. 

2.1 To Check the Reachability on an Oriented Graph 

For a given node qo, we consider depth-first search from the node. First only qo 
will be marked. Every edge is crossed two times. Given a node, the considered 
path includes first the ingoing edges and then the outgoing edges. After crossing 
an edge in the positive direction from the marked node q to the node r we mark 
r too. The process is linear in the number of edges (see for details). 

The set of marked nodes forms a set of nodes that are reachable from qo. 
The procedure may be repeated for any node of the graph G. 

The time of the algorithm for all pairs of nodes is 0{n?). 



2.2 To Verify Local Threshold Testability 

Let us find all SCC of the graphs F, F^ and F^ and mark all 5'CC'-nodes (O(n^) 
time complexity). 

Let us recognize the reachability on the graph F and F^ and form the table 
of reachability for all pairs of F and F^. The time required for this step is 

Let us check the conditions of lemma El For every S'CC-node (p, q) (p ^ q) 
from F^ let us check the condition p ^ q. A negative answer for any considered 
node (p, q) implies the validity of the condition. In opposite case the automaton 
is not locally threshold testable. The time of the step is O(n^). 

For every four nodes p,q,r,r^ of the graph F, let us check the following 
conditions fsee II PI: p h r ^ ri and p ^ q. In a positive case, let us form SCC 
7scc(p, QTTi) of all nodes t £ F such that (p,r^) ^ (q, t) and (q, r,t) is an 
5'CC'-node. In case that SCC Tscc is not well defined the automaton is not 
threshold testable. The time required for this step is O(n^). 

For every five nodes p, q, r,qj^,ri from F let us check now the sec- 
ond condition of theorem II Yl If non-empty components T 5 ( 7 ( 7 (p, q, r, r^^) and 
T!scc(p,riqiqi) exist, the node (p,q;^,ri) is an S'C'C-node of the graph F^ 
and (q, r) ^ (qi,ri) in let us verify the equality Tscc(p, Q, r, r;^) = 
Tscc(Pj Q: Qi)- If file answer is negative then the automaton is not threshold 
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testable. A positive answer for all considered cases implies the validity of the 
condition of the theorem. The time required for this step is 0{n^). 

The whole time of the algorithm to check the local threshold testability is 
0{rv>). 

3 The Local Testability 

We present now necessary and sufficient conditions of local testability of Kim, 
McNaughton and McCloskey m m) in the following form: 

Theorem 31 f lJ4l ) A DF A with state transition graph F and transition semi- 
group S is locally testable iff the following two conditions hold: 

1) For any SCC-node fp, qj from F^ such that p ~ q we have p = q. 

2) For any SCC-node (p,(l) from F^ such that p q and arbitrary element 
s from S we have ps F q is valid iff qs ^ q. 

The theorem implies 

Corollary 32 A DF A with state transition graph F over alphabet S is locally 
testable iff the following two conditions hold: 

1) For any SCC-node fp, qj from F^ such that p ~ q we have p = q. 

2) For any node (v,s) and any SCC-node (p-,<\) from F^ such that (p, q) 
(r,s), s ^ q and for arbitrary a from S we have rcr ^ s is valid iff sa F s. 

4 Algorithm to Verify the Local Testability 

In [14^ . a polynomial time algorithm for local testability problem was considered. 
Now we present another version of such algorithm with the same time complexity. 
We hope that it will be more simple. 

Let us form a table of reachability on the graph F (0(n^) time complexity). 
Let us find F^ and all ^CC-nodes of F^. 

For every S'C'C'-node (p, q) (p yf q) from F^ let us check the condition p ~ q. 
(0(n'^) time complexity). If the condition holds then the automaton is not locally 
testable dS3- 

Let us exclude all edges (p,q) — >■ (r,s) from the graph such that s ^ q 
and s ^ p. Then let us mark all nodes (p, q) of the graph such that for some 
a from F from the two conditions per ^ q and qtr ^ q only one is valid. The 
time required for this step is O(n^). 

Then we add to the graph new node (0, 0) with edges from this node to every 
^CC-node. Let us find the set of nodes reachable from the node (0,0). (O(n^) 
time complexity). The automaton is locally testable iff no marked node belongs 
to obtained set (j32I). 

The whole time of the algorithm to check the local testability is O(n^). 

Acknowledgments. I would like to express my gratitude to Stuart Margolis 
for posing the problem and for helpful suggestions on improving the style of the 
paper and to referees for important and useful remarks. 
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Abstract. In this paper, we present a taxonomy of algorithms for con- 
structing minimal acyclic deterministic finite automata (MADFAs). Such 
automata represent hnite languages and are therefore useful in applica- 
tions such as storing words for spell-checking, computer and biological 
virus searching, text indexing and XML tag lookup. In such applications, 
the automata can grow extremely large (with more than 10® states) and 
are difficult to store without compression or minimization. 

The taxonomization method arrives at all of the known algorithms, and 
some which are likely new ones (though proper attribution is not at- 
tempted, since the algorithms are usually of commercial value and some 
secrecy frequently surrounds the identities of the original authors). 



1 Introduction 

In this paper, we present a taxonomy of algorithms for constructing minimal 
acyclic deterministic finite automata (MADFAs). MADFAs represent finite lan- 
guages and are therefore useful in applications such as storing words for spell- 
checking, computer and biological virus searching, text indexing and XML tag 
lookup. In such applications, the automata can grow extremely large and are 
difficult to store without compression or minimization. Whereas compression 
is considered in various other papers (and is usually specific to data-structure 
choices), here we focus on minimization. 

We apply the following technique for taxonomizing the algorithms: 

1. At the root of the taxonomy is a simple, if inefficient, algorithm whose cor- 
rectness is either easy to prove or is simply postulated. 

2. New algorithms are derived by adding an algorithm detail — a correctness- 
preserving transformation of the algorithm or elaboration of program state- 
ments. This yields an algorithm which is still correct. 

3. By carefully choosing the details, all of the well-known algorithms appear in 
the taxonomy. Creative invention of new details also yields new algorithms. 
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This technique was applied on a large scale in the my Ph.D dissertation |H|. 
The dissertation also contains taxonomies of algorithms for constructing finite 
automata from regular expressions and for minimizing deterministic finite au- 
tomata. Here, we assume some familiarity with the common algorithms for au- 
tomata construction and minimization. 

1.1 Related Work 

The work presented here is significantly different from the taxonomies presented 
in the dissertation, since specializing for MADFAs can yield particularly efficient 
algorithms. 

Some of the algorithms included in this taxonomy were previously presented, 
for example, in Turkey 0 (the present author, Jan Daciuk and Richard Watson), 
at WIA’98 p] (the present author) and in P] (Stoyan Mihov). Other algorithms 
for the MADFA construction problem have typically been kept as trade secrets 
(due to their commercial success in applications such as spell-checking) . As such, 
many of them have likely been known for some number of years, but tracing the 
original authors will be difficult and proper attributions are not attempted — 
though I would welcome hearing from researchers who performed some of the 
original work. 

1.2 Preliminaries 

We make the following definitions: 

— FA is the set of all finite automata. 

— DFA is the set of all deterministic FAs. 

— ADFA is the set of all acyclic DFAs. 

— MADFA is the set of all minimal A DFAs. 

More precise definitions are not required here. In this paper, we are primarily 
interested in algorithms which build MADFAs. The algorithms are readily ex- 
tended to work with acyclic deterministic transducers, though such an extension 
is not considered. 

For any M G FA, \M\ is the number of states in M and C{M) is the language 
(set of words) accepted by M . The primary definition of minimality of an M G 
DFA is: (V M' G DFA : C{M') = C{M) : \M\ < \M'\). 

Predicate Min{M) holds when M G DFA and the above definition of mini- 
mality both hold. A useful DFA property is: £{M) is finite A Min{M) = M G 
MADFA. 

All of the algorithms presented here are in the guarded command language, 
a type of pseudo-code — see |2| . 

2 A First Algorithm 

In this section, we present our first algorithm and outline some ways in which 
to proceed. The problem is as follows: given alphabet F and some finite set of 
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words W G r* (the containment is proper, since F* is infinite), compute some 
M € ADFA such that C{M) = W A Min{M). In the algorithms that follow, 
we give M the type FA, which is the most general type in the containment 
MADFA c ADFA c DFA C FA. At any point in the program, the variable M 
may actually contain a MADFA. 

Given this, our first algorithm (where S' is a program statement still to be 
derived) is: 

Algorithm 2.1: 



{W cr* AW is finite } 
S 

{ £(M) = W A Min{M) } 



□ 

In order to make some progress, we consider a split of statement S to accom- 
plish the postcondition in two steps: 

Algorithm 2.2: 



{W cr* AW is finite } 

So; 

{ C{M) = f{W) A X{M) } 
Si 

{ £(M) = W A Min{M) } 



□ 

There are, of course, infinitely many choices for function / and predicate X, 
some of which are not interesting. For example, if we define f{W) = 0, then 
after So, we will have accomplished virtually nothing (since the automaton will 
accept the empty language), regardless of how we define X . For this reason, we 
restrict ourselves to the following three possibilities for /: 

1. f{W) = W (the identity function). 

2. f{W) = W^ (the reversal of the W). 

3. f{W) = ->W (the complement of W: ~<W = F* — W). 

Other choices are possible. These were chosen because: 

— in some sense, statement So will accomplish a reasonable amount of work, 

~ it is reasonably easy to convert an /(FF)-accepting DFA to a FF-accepting 

one, and 

— with these choices, we can arrive at many of the known algorithms. 

We consider each choice of / in the following sections. 
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3 f{W) = W 

We can now turn to choices for predicate X. Clearly, any predicate can be 
chosen, but we restrict our choices to (at least) arrive at the known algorithms. 
As a first option, consider strengthenings of Min, that is: X{M) Min{M). 
In that case, we choose 5'i to be the skip statement (which does nothing, since 
Min{M) already holds), and we are left with a statement Sq which is as difficult 
to derive as our first algorithm. For this reason, we abandon strengthenings of 
Min (including the possiblity X{M) = Min{M)). 

Instead, we turn our attention to weakening^ of Min. We begin with the 
extreme of these weakenings: true. 



3.1 X{M) = true 

By writing our choices of / and X in full, our program becomes: 

Algorithm 3.1: 



{W cr* AW is finite } 
^o; 

{ £(M) = W } 

Si 

{ £(M) = W A Min(M) } 



□ 

For So, we can use any algorithm which yields an automaton M such that 
£(M) = IF. In IQ we separately consider algorithms for doing this. 

If the expansion of Sq is an algorithm yielding a DFA, then for Si we can 
use any of the minimization algorithms in jSl Chapter 7] or the one given by 
P]. If M delivered by Sq is not deterministic, we can either use Brzozowski’s 
minimization algorithm (see Chapter 7] ) or first apply the subset construction 
(to determinize M) and then any one of the other minimization algorithms. 

Clearly the extensive choices for So and Si yield an entire subtree of the 
taxonomy — and therefore an entire family of algorithms. 



3.2 X(M) = M E DFA 

This yields: 



^ We could equally choose some X which is not related by implication to Min-, this has 
not been explored and is a topic for future research, since it may lead to interesting 
algorithms. 
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Algorithm 3.2: 



{W cr* AW is finite } 
^o; 

{ £(M) = W AM e DFA } 
{ £(M) = W A Min{M) } 



□ 

The choices for Sq are as in a and are discussed in Similarly, for Si, there 
are a number of choices (see 0 Chapter 7] and — though we are already 
certain that M is a DFA and no determinization step is required. 

3.3 X{M) as Partial Minimality 

In 0, a partial minimality predicate is introduced and it is shown to be a 
weakening of Min. This yields the following algorithm: 

Algorithm 3.3: 



{W cr* AW is finite } 
^o; 

{ £(M) = Vb A X{M) } 

Si 

{ C{M) = W A Min{M) } 



□ 

In the original paper, So is derived as an algorithm which constructs M as & 
partially minimal DFA, while Si is derived as a ‘cleanup’ phase to finalize the 
minimization. The interested reader is referred to the presentation in that paper. 

4 f{W) = 

It is no accident that reversal was used in /: it is known to be related to min- 
imality via Brzozowski’s minimization algorithm jS] (in that presentation, the 
history of the algorithm is given, along with full correctness arguments for each 
part of the algorithm) . Brzozowski’s algorithm, for some M S FA (not necessarily 
a DFA), is: 

Algorithm 4.1: 



M' : = reverse(M); 

M' : = determinize{M'); 

{ £(M') = £(M)« AM' G DFA } 
M' : = reverse(M'); 

M' : = determinize{M') 

{ C{M') = £(M) A Min(M') } 



□ 
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Thanks to this, the most obvious choice for predicate X is X{M) = M G DFA. 
In that case, our program is 

Algorithm 4.2: 

{W cr* AW is finite } 

^o; 

{ £(M) = A M e DFA } 

{ £(M) = W A Min{M) } 



□ 

Using Brzozowski’s algorithm, we expand in the above program: 

Algorithm 4.3: 



{W cr* AW is finite } 

^o; 

{ £(M) = W^ AM gDFA} 
M : = reverse(M); 

M : = determinize(M) 

{ C{M) = lU A Min{M) } 



□ 

For So, there are a number of algorithms for building a DFA from W (see (jT)), 
and we can trivially modify them to deal with W^. 



5 f{W) = ->W 

It is known that DFA minimality is preserved under negation of the DFA, at least 
using most reasonable definitions of a negating mapping. Armed with this, we 
choose X{M) = Min{M). This yields 

Algorithm 5.1: 



{W cr* AW is finite } 
So; 

{ C{M) = -.IT A Min{M) } 
Si 

{ C{M) = IF A Min{M) } 



□ 

^ We assume a negating mapping which is able to work on DFAs with partial transition 
functions, since a total transition function would (in the case of non-empty DFAs) 
be cyclic. 
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With the preservation of minimality under negation, we select to be the 
negation function, giving 

Algorithm 5.2: 



{ W c r* A W is finite } 
^o; 

{ £(M) = ^W ^ Min{M) } 
M : = negate{M) 

{ £(M) = W A Min{M) } 



□ 



Statement 5'o can be further split, giving 

Algorithm 5.3: 



{W cr* ^W is finite } 

5 ' ■ 

{ £(M) = -nW } 

O". 

{ £(M) = -.W A Min{M) } 
M : = negate{M) 

{ £(M) = W A Min{M) } 



□ 

In this case, Sq builds M corresponding to -iW; this can be accomplished by 
first building M corresponding to W and then applying negate. (There may, of 
course, be other algorithms still to be derived.) Subsequently, Sq corresponds 
to some minimization algorithm, for example, those given in The running 
time advantages of including this negation step are not yet clear. 

6 Min(M) as an Invariant 

In the previous section, we have considered algorithms with two parts: So and 
Si . We return to Algorithm 12.1 1 — the root of the taxonomy — to obtain the 
following algorithm, where we use Min{M) as a repetition invariant: 

Algorithm 6.1: 



{ W C £* A W is finite } 

M : = empty _DFA; 

Done, To-do : = 0, W; 

{ invariant: C{M) = Done A Min{M) A Done U To.do = W A Done fl To_do = 0 
variant: |To_do| } 
do To_do 0 — >■ 

S 2 ', { choose some word in To -do } 
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{ w G To-do } 

Done, To-do : = Done U {w}, To-do — {tn}; 

^3 



od 

{ Done = W} 

{ C{M) = VI" A Min{M) } 



□ 

We now consider possible versions of statements S'2 and S 3 . There are two 
straightforward ways to proceed with S'2: 

1. Lexicographically order the words in W. Obtaining the elements of W in 
lexicographic order is easily implemented. To implement statement S3, a 
derivation was recently given in [P, and the interested reader is referred to 
that paper. 

2. Unordered choice from W. This is the easiest way in which to select an 
element of W. As above, an implementation of S3 was also derived in [IJ, 
and it is not considered in detail here. 

These two algorithms are the only two fully incremental MADFA construction 
algorithms known. Both of them have running time which is linear in the size of 
W (as does Revuz’s algorithm ^ — an algorithm related to the two mentioned 
here). 



7 Constructing a (Not Necessarily Minimal) Finite 
Automaton 

In this section, we briefly discuss some algorithms for constructing a finite au- 
tomaton from W: 

1. One obvious (though not very efficient) method is to first build a regular 

expression from W {as wq + wi -\ h for words Wi S W) and then 

use one of the general construction algorithms given in pi Chapter 6]. This 
algorithm has not yet been benchmarked, although it is likely to be slow due 
to the generality. It is possible, however, that some improvements could be 
made based upon the simple (star-free) structure of the regular expressions. 

2. For each w G W, we build a simple linear finite automaton with |w| -|- 1 
states (the transitions are respectively labeled with the letters from w). The 
final (nondeterministic) finite automaton is built by combining all of the 
individual automata and adding a new start state with e-transitions to the 
individual start states. As with the above algorithm, this one is very likely 
to be slow. 

3. For each w G W, we apply the standard algorithm for adding a word to a 
trie-structured DFA. Such algorithms are presented in most algorithm texts. 
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8 Conclusions 

We have presented a straightforward taxonomy of algorithms for constructing 
minimal acyclic deterministic finite automata. The taxonomy begins with an 
algorithm which has unelaborated statements, postulated to be correct. Each of 
the subsequent algorithms is derived by applying correctness-preserving trans- 
formations to the initial algorithm. In the course of constructing the taxonomy, 
all of the pre-existing algorithms were derived — including some of the most re- 
cently presented incremental algorithms. Furthermore, the taxonomy elaborated 
on two other groups of algorithms: 

— Many of the original, and efficient, algorithms were previously only known 
as trade secrets in industry. 

— Some of the intermediate algorithms contain dead-ends or have derivation 
possibilities which are unexplored. 

There are a number of areas of future research: 

— Although most of the algorithm details are intuitively correct, the full cor- 
rectness arguments must be provided. 

— A number of unexplored directions were highlighted in the taxonomy. Some 
of these may, in fact, lead to new algorithms of practical importance. 

— The theoretical and benchmarked running time of the algorithms has not 
been adequately explored and are not given in this paper. This will allow 
the careful choice of an algorithm to apply in practice. 
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